It frequently gives outdated answers that violate standard of care. No way to know which sentences are the wrong ones. The result is every fact must be searched again manually anyways. There is no time saving, and on its own, literally criminal advice.
She still uses it for tasks like rewriting emails for tone. But not anything to do with getting medical knowledge.
This should not be surprising to anyone who has used these. Tasks where users can self-validate and riff, like writing and coding, are great. But medical knowledge Q&A would need something like links back to current sources for every tidbit for it to be in this camp.
Never mind complicated fields like oncology, ChatGPT gets even basic medical stuff wrong like treatment for minor sports injuries. If you ask it whether ice is an effective treatment it will basically answer yes, and give a long list of reasons why. This is unsurprising since there are so many web pages still recommending the outdated RICE (rest, ice, compression, elevation) protocol. But it turns out that protocol was never based on any reliable scientific evidence and in most cases likely does more harm than good.
ChatGPT gets animal classification frequently wrong. There is an example made famous on Reddit about how it insists that the sailfish is a mammal. I could reproduce it this afternoon by simply asking it: "What is the fastest sea mammal?". It is totally confident about the fact that, although sailfish are born out of eggs, their offspring lick milk from their mother's skins (WTF?)
I've been playing with other questions about animal classification (ex: "give me a list of venomous animals in the amphibia class") and it often butchers them.
Based on these observations, I'd bet the house that no, large language models cannot under any circumstance reason about medical questions at present.
> The fastest sea mammal is the dolphin. Dolphins are known for their speed and agility in the water and can swim at speeds of up to 45 miles per hour. They are also excellent jumpers, often leaping out of the water as they swim. In addition to their speed, dolphins are also highly intelligent and social animals, known for their playful behavior and vocalizations. Other fast-swimming sea mammals include whales, such as the killer whale (also known as the orca), which can swim at speeds of up to 30 miles per hour, and certain species of sharks, such as the shortfin mako shark and the great white shark, which can swim at speeds of up to 60 miles per hour.
It also knew about the sailfish when asked directly
> Is a sailfish a mammal?
> No, a sailfish is not a mammal. It is a fish that belongs to the genus Istiophorus of the billfish family.
I wonder if there were incorrect statements in the context that biased it.
It doesn't always give consistent answers to the same question. As of 15 minutes after you:
what is the fastest sea mammal?
The fastest sea mammal is the sailfish.
It is a species of billfish that can swim at speeds of up to 68 miles per hour (109 km/h).
[...]
Other fast-swimming sea mammals include the dolphin and the marlin, which can swim at speeds of up to 50 miles per hour (80 km/h).
Interesting that in this case the capital on the w seems to make a big difference. I ran it a few times with a capital W and a lowercase w, it said sailfish for the lowercase w most of the time and switched between an orca and a dolphin for the capital W. I wonder if all the training questions with good answers in the fine tuning set were capitalized
I asked about the intersection of animals and health by describing the kind of lentigo typical in orange cats (without naming it) and asking if it was dangerous. It jumped to cancer as its first diagnosis, but I guess that puts it on par with WebMD.
There is probably a paper there: Which is more accurate at reasoning about a set of symptoms? WebMD, or ChatGPT? A medical doctor would produce the original set of questions and evaluate the responses. A statistically average set of people would be given the set of symptoms and then asked to use either WebMD or ChatGPT to figure out what the disease is.
Some people would claim that this is a special case, and if large language models can save more lives statistically speaking than human doctors, then that's a win.
It is hard to overestimate just how scarce even the median human doctor competency is, at the global scale.
If large language models indeed can give billions access to mostly-reliable medical diagnosis ... it will be a huge humanitarian win.
Honestly, I hoped that this company https://www.humandx.org/ will make it real, but at same point they apparently pivoted to providing app-based quiz training to doctors.
Maybe just taking an off the shelf instruct-tuned language model and tuning it on expert-curated corpus of doctor-patient dialogue is the right way of approaching this. Someone will do it, it's not even as costly as people imagine.
Computer aided diagnostic software has been around for decades. It isn't very useful in improving medical outcomes.
Diagnosis isn't particularly hard in routine cases. The hard part is in gathering relevant data and entering it into a computer. Some parts of that data gathering can be automated to an extent, but in general it remains an unsolved problem. For example, large language models can't collect a useful patient history.
I agree with the fundamental data/measurement bottleneck. Humans should likely concentrate our effort on scaling and democratizing this, physical, side of the equation.
Still, given the apparent usefulness of basic unaided visual analysis https://www.ncbi.nlm.nih.gov/books/NBK330/ I expect the upcoming multimodal language models to provide a tangible enhancement to diagnosis accuracy, if we allow them to see the photos of the patient.
I’m on a number of support forums for illness. It is shocking how often doctors are flat wrong. Refuse to run tests, or ignore positive tests. Constantly see people post test results that are a clear positive and doctor says nothing is wrong.
This is on top of some diagnosis can take months to multiple years.
Of course it goes both ways. A number of conditions can often be managed by diet alone. Some People will go ballistic if you suggest they try this.
Umm. You have 3 months until next appointment. Why not try healthy food in mean time?
I’m not going to dispute this statement, however it’s not clear how you can tell when it’s the case. The only way seems to consult multiple doctors and do a majority voting. Is that how you operate your illness support forum?
People post what symptoms are. What tests they have had, what doctors have said.
People make suggestions on what conditions they should investigate. What tests should be run next. What kind sort of doctor they should see next.
Something like Sjogrens, doctors tend to very out of date on testing criteria. So often a patient has every symptom. But doctor is highly dismissive of it.
Or a test with a high false negative rate is negative. Doctor immediately stops testing. Even though better tests exist.
Another example is when I see someone complaining about double vision. They need to go get a spinal tap to measure pressure. Not a lot of doctors realize that.
Retired neurosurgical anesthesiologist (38 years experience) here.
It would never occur to me (or neurosurgeons I worked with) to order a spinal tap if a patient presented with double vision. Elevated cerebrospinal pressure — if present — could cause acute brain stem herniation and sudden death.
The most likely near-term way LLMs could help would be generating hypotheses about illness and treatment which would then be considered by a doctor who can understand when the LLM is mammalizing fish so to speak.
> Based on these observations, I'd bet the house that no, large language models cannot under any circumstance reason about medical questions at present.
I think that would be a bad bet, unless you're leaning heavily on the words "at present." I don't see any fundamental reason why language models can't reason about any particular thing, provided they're taught about that thing.
"Here is an outline of a quantum algorithm that can be used to sort a list of integers:
"Initialize a quantum register with the input list of integers. Each integer in the list can be represented as a qubit (quantum bit) in the quantum register.
"Use a quantum circuit to apply a quantum Fourier transform (QFT) to the quantum register. The QFT is a quantum algorithm that can be used to perform a transformation on the quantum register that maps the integers in the list to their corresponding frequencies in the frequency domain.
"Use a classical computer to measure the output of the QFT. This will result in a list of integers in the frequency domain, which can be sorted using a classical sorting algorithm.
"Use a quantum circuit to apply an inverse QFT to the sorted list of integers in the frequency domain. This will transform the list back to the original domain, resulting in a sorted list of integers.
"This is just a high-level overview of the algorithm, and there are many details and considerations that would need to be addressed in order to fully implement it. However, I hope this gives you a general idea of how a quantum algorithm can be used to sort a list of integers."
ChatGPT has been taught about sailfish - it has had access to a corpus of several centuries worth of knowledge regarding sailfish - and it will try to persuade you that it is a mammal, making up some outlandish facts along the way.
Yea, I lean on "at present" because the question in the paper is whether LLMs (GPT 3.5 specifically) can reason about medicine now.
If gradient-descent in DNNs indeed has an intrinsic bias for short and simple solutions, like this paper https://arxiv.org/abs/2006.15191 shows, and if the "reasoning" is among the shortest solutions to match the textual data with the language modeling loss, then at some limit of model+data scaling the model has to recover this solution. Or maybe even something better, "reasoning++"?
It's all too easy to dismiss the success of infinitely plastic, scalable systems.
"For every complex problem there is an answer that is clear, simple, and wrong." Possibly said by H. L. Mencken.
Also, "The fastest sea mammal is the sailfish, ... I am confident that the information I provided about the fastest sea mammal is accurate. The data on the speeds of different marine animals is widely available and has been widely studied and reported on by scientists and other experts."
Then you need to provide a definition for "learn" and "reason" and an argument as to why. These models are trained to do the task they do, which is, by definition learning. I'll leave "reasoning" up to you, because that's a fuzzier question, but I still would like to see what your fundamental objection is.
> Based on these observations, I'd bet the house that no, large language models cannot under any circumstance reason about medical questions at present.
Did you read the paper? You already lost this bet. And even if this paper didn’t exist, it would be a very naive and short-sighted prediction. We’re just at the start of understanding what LLMs can achieve.
> Based on these observations, I'd bet the house that no, large language models cannot under any circumstance reason about medical questions at present.
Wtf? It just did. Have you read the paper? Wake up.
It's unclear where in the paper they demonstrate, unequivocally, that reasoning occurred, rather than pure statistical model-fitting (we generally believe this is one of the key differences between AGI and modern LLMs)
reasoning takes into account much more contextual information. I think many would also say it's a step-by-step process in which the explicit assumptions and conclusions could be stated clearly, although I think that's probably just anthropomorphism and projection.
It gives me the correct answer (as judged by experts) to a complex medical problem. If you want to keep saying this isn't reasoning, be my guest. But IMO that's irrelevant.
Spitting out the results of a bunch of experts is the opposite of reasoning. Or more correctly, it's one limited form of reasoning: an ensemble prediction made from a mixture of experts. That's not an interesting form of reasoning because literally all the decisions are encoded in the weights you apply to the experts.
> This paper is not even using GPT-3.5, it's using PaLM.
Where are you getting this?
From the excerpt:
"We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions"
I am going to be repeating that ChatGPT cannot classify popular animals correctly as many times as needed. I've been using the app daily and it's quite frequently factually wrong about many other topics, too (for example, programming, or history).
Elsewhere on this thread, you're rude to me whilst I made an intelligent comment in good faith. Based on your current comment (arguing with people over technical details), may I suggest you perhaps back off a bit and not be so aggressive about "being right"?
Sorry, I did think you were not doing a good faith argument. But if you are, then let me ask you: if you believe that reaching expert's opinion is not evidence of reasoning then what are medical students doing for 6 years at college?
There's an enormous amount of memorization of details. But, they also learn to reason... and explain their reasoning.
More importantly, simply regurgitating what experts say (within-class performance) isn't interesting because it's not going to convince anybody that it's going to make good decisions.
I've been using ChatGPT for basic medical questions that I've been having a hard time getting details on with google. So far the responses I've been getting have been quite informative with sound reasoning.
I'm just using it as another tool. I'm not taking the responses at face value, but instead using them as a guide to help me do more research elsewhere.
Yes, I've discussed things with a doctor as well, but if the symptoms are too general, I don't get very far with that approach.
Given that large language models don't have any actual knowledge (in a human sense) of the data they've been trained on other than raw statistics, can they be said to "reason" about anything?
How sure are we that knowledge in humans is fundamentally different from a language model? How do you generally represent knowledge? Graphs? Patterns with typed place holders? Could those structures be embedded into a language model?
>How sure are we that knowledge in humans is fundamentally different from a language model?
Because humans learn about language in real world contexts, accompanied by multiple sensory streams. LLMs learn about it solely by being exposed to text. Imagine a child kept in a closet who learns about the world solely by reading books, with no indication which are fact and which are fiction, and no exposure to the physical world outside. LLMs cannot reason, only regurgitate probabilistically.
What does grok semantics mean, how would you test this?
I am not trying to argue that language model are close to human level reasoning or whatever, but it is not obvious to me that a language model of some sort is fundamentally unable to to achieve this.
The question of whether language models can "reason" ultimately comes back to the chinese room argument. There aren't any universally accepted resolutions to that argument either, nor whether those limitations (if they exist) are fundamental.
However, the current incarnation of LLMs like ChatGPT fail at straightforward reasoning around things like categorizing sailfish and mathematical proofs. I'm incredibly impressed at how powerful the simplistic methods they use are, but the deficiencies make it clear that it's not abstractly reasoning from definitions in the same way humans are able to.
A fun thing to do: Ask ChatGPT a question, and ask for references. Then try to track down the references. I just asked a simple question about the Thirty Years War, and with the answer it gave me two valid references (including C.V. Wedgewood, yay), and one that was completely made up.
As far as I understand, LLMs have no model of reality to base results on; they're just an enormously complex statistical model of "what is the next word in the series".
> is this all just a game of syntax that _feels_ like semantics from the inside?
The point is that in case of LLMs, there is no mystery here: it is syntactic templates (x) statistics. But do we know that 'syntax' is all there is to our mental reasoning? We don't know this (yet). I have flirted with the notion - heretical for me since I 'believe' in primacy of meaning, not form - and can kinda squint and see meaning possibly being a 'virtual' phenomena and possibly simply a 'narrative' of 'arcs' connecting 'facts' mapped to a syntactic template. That meaning is simply a process. [So, not even "42" ..]
Could be. Does not even have to feel like anything, this could be a independent additional component. This goes into the direction of the Chinese room, how do you figure out if someone understands something? I would say you ask questions and judge by the answers.
The knowledge is stored in the model's weights in the same way the knowledge you possess is stored in your neurons. There is not much difference, honestly.
My understanding is that it basically interpolates between the most frequent sequences of words, based on context. There is no concept of expressing a rational reasoning underpinning the words. I don't think it is comparable at this stage. It is just something very good at throwing you back segments of text it has found on the internet.
What you are probably missing is that for the model to be competent at predicting the next the word IT HAS to learn how to reason as a human being would.
I don't see why this is. Humans aren't trained to perform a statistical analysis of what the next word is likely to be. We learn language from other humans in social and environmental context. The LLM is just being trained on text.
Nobody is saying these models are trained the same way humans are. I am just saying that during their training task (predicting the next word) they end up developing human reasoning. That of course wasn't obvious before these large models appeared. But now it's pretty clear that's the case. It wouldn't be able to perform so well in so many tasks if it didn't have something resembling reasoning capabilities.
I asked ChatGPT about the Chinese Room and asked for references. The first it came up with was "Minds, Brains, and Programs" by John Searle (1980).
The next two were, "The Chinese Room: Searle and Artificial Intelligence" edited by John Preston and Mark Bishop (2002) (Preston and Bishop did publish Views into the Chinese Room in 2002, but I can't find anything by the given title) and "The Chinese Room Argument" by David Chalmers (2006) (Chalmers has written on the topic, and published three articles in 2006, that's not one of them.
Also:
"Two cities: one, a city of “knights”, and the other of “knaves”. The residents of the two cities like to visit each other. One day, a man enters one of the cities, but doesn’t know which. He decides to ask the first person he meets where he is. The answer he receives is that he is in the “knaves” city."
produced the result:
"The man cannot be sure that the person he met was telling the truth, because the person could be a knight from the city of knights, who always lie, or a knave from the city of knaves, who always tell the truth.
"To determine which city he is in, the man could ask the second person he meets the same question. If the first person was a knight, then the second person would also be a knight, and would lie about being in the city of knights. If the first person was a knave, then the second person would also be a knave, and would tell the truth about being in the city of knaves.
"By asking the same question to multiple people and analyzing their responses, the man can eventually determine which city he is in. This problem, known as the "Knights and Knaves" puzzle, is a classic example of logical reasoning and problem-solving."
It seems to recognize the "knights and knaves" form of puzzle, but doesn't get the knights and knaves right. And it's not getting the reasoning part at all even given varying prompts.
I don't understand what point you are making. Are you saying that it lacks any reasoning capability because it fails this test? I say try a bunch more tests and then get the average to really have a better sense; one test is meaningless.
How many opportunities would you give someone who, when asked to justify their answer, routinely made up references, and who ignored important parts of a question and whose answer wouldn't work anyway?
As it turns out, we kinda do know what is going on in a LLM, and there's no model of reality for it to base reasoning on in there.
Does that mean that's impossible? No. Does it mean seeing reasoning there is making the same mistake as those who enjoyed chatting with Parry and Eliza? Probably.
I hear this type of opinion expressed frequently by coders. I use ChatGPT pretty much every day (to help with various tasks) and there's just no way to hold on to this kind of belief after you get to "know" it at that level. It clearly can reason, maybe not in exactly the same way as a person. It is superior to the average person just in terms of following a logical argument in many cases. Maybe that is a low bar but I can't help but think that it is much higher than you think.
Try asking it one of Ray Smullyan's logic puzzles, like "There are two cities: one, a city of “knights” who always tell the truth, and the other of “knaves” who always lie. The residents of the two cities like to visit each other.
One day, a man enters one of the cities, but doesn’t know which. What question should he ask to find out?"
I had to add the part about the behavior of knights and knaves, because ChatGPT seems to recognize the puzzle but gets them wrong, and it seems to be completely ignoring the visiting behavior. Further, the answer doesn't really work.
Interpolation is a leaky analogy. With scale these models acquire increasingly sparse discrete-like activation patterns, and there are techniques[1] amounting to pinpointing the exact locations of the knowledge and updating it freely.
It seems that nonlinear interpolation in a rich enough latent space can give you a sufficiently good approximation to reasoning.
This must be about the 50th time I've seen the exact same comment with the exact same response. It's getting very dull. At some point somebody will say "it gets stuff confidently wrong all the time" and someone else will say "so just like people do?"
To call the model "just statistics" is meaningless. To call it "just like a brain" is unsupported. There are better ways to debate this.
Not exactly unsupported. There are a bunch of papers showing a correlation between patterns of brain activation and weights activations inside an AI model when they are given the same task or stimuli. And this correlation increases with AI model accuracy. Doesn't that suggest that AI models are partly mimicking the same human reasoning pathways?
This is an open and active question, but there really isn't any literature that's strongly trusted to show that there is a truly deep relationship between how current machine learning models operate (at the mathematical level) and how brains work. The suggestion is there, no doubt, but it's unclear where that suggestion will ultimately lead. There certainly has been quite a deal of health cross-fertilization between neuroscience and computer science.
It seems save to say that ML has been moving further and further away from what neuroscience suggests, and towards efficient execution on fast GPUs and TPUs. That's in part b/c neuroscience is currently not equipped to explain how human high-level intelligence works.
Agreed! I think one useful direction to take this debate is to admit that in order for you or me to interact with the tool we need to have a mental model of the tool (so we have some idea of how to interact with it).
So which mental model is superior when evaluated under the criterion of "ability to interact productively with the tool"? (a) It's just statistics / stringing together phrases from the internet or (b) it's more like a human brain with vastly superior general knowledge?
Personally, I would put money on (b) resulting in better interactions. This is probably something we could study empirically.
But ultimately I don't really care that much about the philosophical issues, my mental model of it (a more nuanced version of b) has enabled me to use it productively almost every day to do things that I couldn't do before.
Is that what neuroscientist and machine learning people are saying? Because I've heard there are significant differences between actual neurons and digital neural networks. Anyway, I think analyzing knowledge at the level of neurons misses the bigger picture. We learn from interacting with the world and other people via bodies, not vast amounts of textual tokens.
See my comment in this thread about how ChatGPT thinks that sailfish are mammals. I possibly have never read the classification of sailfish explicitly in a book or a documentary, but I could have classified it as a fish any time of day. For one it's in the name, and furthermore I can imagine with high probability that baby sailfish do not drink milk from their mothers.
I don't know if the knowledge in an LLM is stored like in my neurons - I know nothing about neurobiology. But after having used ChatGPT extensively I can certainly tell that ChatGPT does not mobilize knowledge like humans.
It's definitely missing a "module" that humans have: something like an ability to know & admit when it's not certain about something. The way I think of it is: imagine if you constitutionally unable to ever admit that you didn't know something and you were forced to continue talking about that topic. You too would have no choice in this situation but to start spouting nonsense. Once we add this model of uncertainty, I think this type of problem will go away and it will say something like "I'm not sure but ...".
>The knowledge is stored in the model's weights in the same way the knowledge you possess is stored in your neurons.
We don't know that this is true. But even if it is, there is a critical difference in the knowledge stored. Human knowledge incorporates direct experience of the world through multiple senses, while an LLM's knowledge is solely derived from text and so misses some critical physical constraints and contexts.
> The knowledge is stored in the model's weights in the same way the knowledge you possess is stored in your neurons. There is not much difference, honestly.
Do we know this to be true? I guess I'm not aware that we understand very well how memories are stored/recalled in a human brain.
It's a silly question to ask as the models can't reason about anything and know nothing about the world. They just return statistically relevant output to input and mash the output in such a way that there is no way to know if there ever was a corresponding human input or if it was completely mashed together from non-related inputs making it absolutely false.
Based on threads like yesterday's https://news.ycombinator.com/item?id=34172092 , some vocal members of the HN crowd might prefer the opinion of "ChatGPTMD" to a board certified US physician (and perhaps with good reason in some cases).
Definitely. I've sometimes dreamed of leaving clinical medicine to pursue a career helping integrate some "smart" helpers into EHR (ones designed to help physicians and patients instead of just billers and coders).
With 10 minutes and scikit-learn I can get pretty good test-set accuracy for length of stay based on pretty bare bones input (time of day, mode of arrival, age / sex, bigram / trigrams of chief complaint). Given access to vitals, labs, historical diagnoses, a provider's (and their colleagues') historical practices, and a little PyTorch / HF transformers I bet there would be some low hanging fruit for suggested diagnoses and quick pick orders. Could in some cases streamline getting to the same end result, in others possibly make recommendations to bring practice more in line with current recommendations or to prompt investigation into uncommon presentations or rare diseases.
And certainly would have some sharp edges and potential harm too. But based on HN reports of experiences with US medicine, seems like the potential upside may be worthwhile.
Go ahead and do it. The SMART on FHIR platform is now supported by most major EHRs and should make it fairly straightforward to write that app. If the data elements you listed have been entered into the patient's chart then they should be available through the API.
The hard part is demonstrating the value of such apps to provider organizations, and getting them to change their workflow. Clinicians are extremely busy and don't want to click more buttons unless it improves patient care (and they get paid for it).
Pre-internet, MD's main job was to use their memory to comb through an immense level of knowledge and find the best fitting explanation for a patient's complaints. A pretty hard task, mostly suitable for very smart people.
Now with internet, it is a process waiting to be disrupted. I would say it is already over due.
As you said, this is a milestone long overdue. The next frontier of transformative success in the medicine should come from finding novel, effective treatments to common chronic diseases.
We need to harness AI/AGI in all possible capacities to help us approach this goal.
The most common chronic diseases can be prevented effectively through proper diet, physical activity, and avoidance of toxins (substance abuse). This is not novel, it has been known for a long time. (Of course there are a minority of patients who just have bad genetics and will suffer from chronic diseases regardless of their lifestyle choices.)
AI technologies are already being applied to some phases of the drug development process. But the low-hanging fruit has mostly already been picked. The odds are low of AI ever finding a miracle drug that mimics the effects of good diet and exercise. The main bottleneck in drug development is phase-3 clinical trials, and AI can't help much with that.
As for AGI, we're not making any visible progress towards that goal. So don't count on it being available in our lifetimes.
> Now with internet, it is a process waiting to be disrupted. I would say it is already over due.
Why is everything in this space about disruption?
Doctors use the Internet all the time. One of my kids is a resident in internal medicine and uses it regularly to find information. Example: videos of medical procedures.
ChatGPT and LLMs in general look like a great extension to existing search. But it still needs somebody to keep an eye on it to ensure it's not confidently spouting bullshit. This looks like something that needs to be carefully controlled (read: regulated) if you want to avoid very serious consequences for unlucky patients.
ML based classifiers should fare better than both.
Too bad research on heterogeneous AI dried up when deep learning popularized, because that's the kind of thing that logical classifiers and ML classifiers together should get the best results.
I mean, if you already have the medical knowledge and you want to automate things, wouldn’t codifying it into an expert system (fancy words for a bunch of if-else’s) make more sense than feeding examples to an ML classifier?
Obviously easier said than done, but in the case of medicine, I think we pretty much know why we make the decisions we do (vs for example face recognition or text generation). Also, high confidence in the predictions and explainability are much more important in medicine.
> but in the case of medicine, I think we pretty much know why we make the decisions we do
Why know why we do many of them, but not all. I'd guess not even most. Image diagnostics, lab analysis, there are plenty of areas where doctors make fuzzy decisions based on a lot of intuition.
Logic classifiers (a superclass of expert systems) are good for when we know why we decide stuff. ML is good for when we don't.
And yes, once we have some AI that do stuff well but don't know why, the next obvious step is to reverse engineer it.
Wouldn‘t it help to feed the model with thousands of doctor‘s reports and diagnoses? Wouldn’t it then be able to predict propable responses based on the data given by the prompt?
It frequently gives outdated answers that violate standard of care. No way to know which sentences are the wrong ones. The result is every fact must be searched again manually anyways. There is no time saving, and on its own, literally criminal advice.
She still uses it for tasks like rewriting emails for tone. But not anything to do with getting medical knowledge.
This should not be surprising to anyone who has used these. Tasks where users can self-validate and riff, like writing and coding, are great. But medical knowledge Q&A would need something like links back to current sources for every tidbit for it to be in this camp.