AI & Humanoids

ChatGPT is as (in)accurate at diagnosis as ‘Dr Google’

View 3 Images
ChatGPT's diagnostic capabilities are limited
DALL-E
ChatGPT's diagnostic capabilities are limited
DALL-E
An example of a standardized prompt fed to ChatGPT
Hadi et al.
The researchers say that AI should be used as a tool to enhance, not replace, medicine's human element
View gallery - 3 images

ChatGPT is mediocre at diagnosing medical conditions, getting it right only 49% of the time, according to a new study. The researchers say their findings show that AI shouldn’t be the sole source of medical information and highlight the importance of maintaining the human element in healthcare.

The convenience of access to online technology has meant that some people bypass seeing a medical professional, choosing to google their symptoms instead. While being proactive about one’s health is not a bad thing, ‘Dr Google’ is just not that accurate. A 2020 Australian study looking at 36 international mobile and web-based symptom checkers found that a correct diagnosis was listed first only 36% of the time.

Surely, AI has improved since 2020. Yes, it definitely has. OpenAI’s ChatGPT has progressed in leaps and bounds – it’s able to pass the US Medical Licensing Exam, after all. But does that make it better than Dr Google in terms of diagnostic accuracy? That’s the question that researchers from Western University in Canada sought to answer in a new study.

Using ChatGPT 3.5, a large language model (LLM) trained on a massive dataset of over 400 billion words from the internet from sources that include books, articles, and websites, the researchers conducted a qualitative analysis of the medical information the chatbot provided by having it answer Medscape Case Challenges.

Medscape Case Challenges are complex clinical cases that challenge a medical professional’s knowledge and diagnostic skills. Medical professionals are required to make a diagnosis or choose an appropriate treatment plan for a case by selecting from four multiple-choice answers. The researchers chose Medscape’s Case Challenges because they’re open-source and freely accessible. To prevent the possibility that ChatGPT had prior knowledge of the cases, only those authored after model 3.5’s training in August 2021 were included.

A total of 150 Medscape cases were analyzed. With four multiple-choice responses per case, that meant there were 600 possible answers in total, with only one correct answer per case. The analyzed cases covered a wide range of medical problems, with titles like "Beer, Aspirin Worsen Nasal Issues in a 35-Year-Old With Asthma", "Gastro Case Challenge: A 33-Year-Old Man Who Can’t Swallow His Own Saliva", "A 27-Year-Old Woman With Constant Headache Too Tired To Party", "Pediatric Case Challenge: A 7-Year-Old Boy With a Limp and Obesity Who Fell in the Street", and "An Accountant Who Loves Aerobics With Hiccups and Incoordination". Cases with visual assets, like clinical images, medical photography, and graphs, were excluded.

An example of a standardized prompt fed to ChatGPT
Hadi et al.

To ensure consistency in the input provided to ChatGPT, each case challenge was turned into one standardized prompt, including a script of the output the chatbot was to provide. All cases were evaluated by at least two independent raters, medical trainees, blinded to each other’s responses. They assessed ChatGPT’s responses based on diagnostic accuracy, cognitive load (that is, the complexity and clarity of information provided, from low to high), and quality of medical information (including whether it was complete and relevant).

Out of the 150 Medscape cases analyzed, ChatGPT provided correct answers in 49% of cases. However, the chatbot demonstrated an overall accuracy of 74%, meaning it could identify and reject incorrect multiple-choice options.

“This higher value is due to the ChatGPT’s ability to identify true negatives (incorrect options), which significantly contributes to the overall accuracy, enhancing its utility in eliminating incorrect choices,” the researchers explain. “This difference highlights ChatGPT’s high specificity, indicating its ability to excel at ruling out incorrect diagnoses. However, it needs improvement in precision and sensitivity to reliably identify the correct diagnosis.”

In addition, ChatGPT provided false positives (13%) and false negatives (13%), which has implications for its use as a diagnostic tool. A little over half (52%) of the answers provided were complete and relevant, with 43% incomplete but still relevant. ChatGPT tended to produce answers with a low (51%) to moderate (41%) cognitive load, making them easy to understand for users. However, the researchers point out that this ease of understanding, combined with the potential for incorrect or irrelevant information, could result in “misconceptions and a false sense of comprehension”, particularly if ChatGPT is being used as a medical education tool.

“ChatGPT also struggled to distinguish between diseases with subtly different presentations and the model also occasionally generated incorrect or implausible information, known as AI hallucinations, emphasizing the risk of sole reliance on ChatGPT for medical guidance and the necessity of human expertise in the diagnostic process,” said the researchers.

The researchers say that AI should be used as a tool to enhance, not replace, medicine's human element

Of course – and the researchers point this out as a limitation of the study – ChatGPT 3.5 is only one AI model that may not be representative of other models and is bound to improve in future iterations, which may improve its accuracy. Also, the Medscape cases analyzed by ChatGPT primarily focused on differential diagnosis cases, where medical professionals must differentiate between two or more conditions with similar signs or symptoms.

While future research should assess the accuracy of different AI models using a wider range of case sources, the results of the present study are nonetheless instructive.

“The combination of high relevance with relatively low accuracy advises against relying on ChatGPT for medical counsel, as it can present important information that may be misleading,” the researchers said. “While our results indicate that ChatGPT consistently delivers the same information to different users, demonstrating substantial inter-rater reliability, it also reveals the tool’s shortcomings in providing factually correct medical information, as evident [sic] by its low diagnostic accuracy.”

The study was published in the journal PLOS One.

View gallery - 3 images
  • Facebook
  • Twitter
  • Flipboard
  • LinkedIn
7 comments
Alan
50% doesn't sound much worse than what the average MD would score.

I also wonder what might happen if other AI's were also compared.
Daishi
I understand why they chose GPT 3.5 for the test. At the time that was running the free version of ChatGPT and newer models with more recent training data would also likely have the more recent test pool questions. I'm sure new models deliver a minor improvement over time but the concerns are valid. Sometimes a well informed incorrect answer that seems correct is actually worse than just being uninformed and not answering at all.
Jonathan
so lets get this straight; you use an outdated version of an llm that is not trained on medical data to diagnose medical conditions and ypu conclude that ai as a whole is bad at medical diagnosis?
Peter
"The researchers say their findings show that AI shouldn’t be the sole source of medical information"

Not true. They looked at ChatGPT and in their conclusion "advise[d] against relying on ChatGPT for medical counsel". They didn't generalize it to all AI.

While I agree with the article's message -- the internet/Google/ChatGPT is no substitute for medical experts --, don't put words in the mouths of the researchers.
Drjohnf
Alan.....most MD's would pass the boards by nearly 100%. Besides, being an MD is much, much more than just passing board exams which are the absolutely easiest task any MD ever needs to accomplish in their career. The ChatGPT is just a LLM at most. It's not really "artificially intelligent". It's just a statistical model that uses large data sets. No intelligence is ever used in its ministrations because it is not in the least bit "intelligent". Even IF a LLM was trained on all of the world's "medical literature", there would be significant bias now because of all of the terribly biased, irresponsible and outwright incorrect medical literature put out by the various pharmaceutical companies that have bought most of the medical journals and which forbid the publication of medical literature that is contrary to their business interests. Far better for a LLM to be trained on reliable medical literature such as the Cochrane database or the UpToDate articles, etc.
christopher
Where are the Medscape statistics about human efficacy? It is beyond fishy that this baseline data is missing - its' lack renders all this research entirely meaningless.

I wonder if MD's score so badly all the time, that they're unable to even tell us those stats?

I can, with certainty, say that modern AI as-of-now is 100% going to make mistakes ranging from smallish to catastrophic on basically every question relating to computer coding (having used it all-day every-day for a few months in a row now). I cannot remember even 1 time I got any question right first-try. I can also say with the same certainty that staff I employ suffer the same problems, if not worse (their work is much less scrutinised!).

I sympathise with the authors on the "Cheating" problem. AI's are the biggest cheats possible - literally having fed on the answers to everything for their existence. It's impossible to gauge their usefulness by asking them any question they've seen before, and humans are lazy: making up new questions is hard.
Alan
Christopher wrote "I can also say with the same certainty that staff I employ suffer the same problems, if not worse (their work is much less scrutinised!)."
----
Which is one reason why we have so many bugs in computer code and why programs/apps have to be updated continuously. IMO, LLM's, AI's or whatever you want to call them will eventually produce almost 100% trouble free code on the first try and our lives will be much improved.