ChatGPT is mediocre at diagnosing medical conditions, getting it right only 49% of the time, according to a new study. The researchers say their findings show that AI shouldn’t be the sole source of medical information and highlight the importance of maintaining the human element in healthcare.
The convenience of access to online technology has meant that some people bypass seeing a medical professional, choosing to google their symptoms instead. While being proactive about one’s health is not a bad thing, ‘Dr Google’ is just not that accurate. A 2020 Australian study looking at 36 international mobile and web-based symptom checkers found that a correct diagnosis was listed first only 36% of the time.
Surely, AI has improved since 2020. Yes, it definitely has. OpenAI’s ChatGPT has progressed in leaps and bounds – it’s able to pass the US Medical Licensing Exam, after all. But does that make it better than Dr Google in terms of diagnostic accuracy? That’s the question that researchers from Western University in Canada sought to answer in a new study.
Using ChatGPT 3.5, a large language model (LLM) trained on a massive dataset of over 400 billion words from the internet from sources that include books, articles, and websites, the researchers conducted a qualitative analysis of the medical information the chatbot provided by having it answer Medscape Case Challenges.
Medscape Case Challenges are complex clinical cases that challenge a medical professional’s knowledge and diagnostic skills. Medical professionals are required to make a diagnosis or choose an appropriate treatment plan for a case by selecting from four multiple-choice answers. The researchers chose Medscape’s Case Challenges because they’re open-source and freely accessible. To prevent the possibility that ChatGPT had prior knowledge of the cases, only those authored after model 3.5’s training in August 2021 were included.
A total of 150 Medscape cases were analyzed. With four multiple-choice responses per case, that meant there were 600 possible answers in total, with only one correct answer per case. The analyzed cases covered a wide range of medical problems, with titles like "Beer, Aspirin Worsen Nasal Issues in a 35-Year-Old With Asthma", "Gastro Case Challenge: A 33-Year-Old Man Who Can’t Swallow His Own Saliva", "A 27-Year-Old Woman With Constant Headache Too Tired To Party", "Pediatric Case Challenge: A 7-Year-Old Boy With a Limp and Obesity Who Fell in the Street", and "An Accountant Who Loves Aerobics With Hiccups and Incoordination". Cases with visual assets, like clinical images, medical photography, and graphs, were excluded.
To ensure consistency in the input provided to ChatGPT, each case challenge was turned into one standardized prompt, including a script of the output the chatbot was to provide. All cases were evaluated by at least two independent raters, medical trainees, blinded to each other’s responses. They assessed ChatGPT’s responses based on diagnostic accuracy, cognitive load (that is, the complexity and clarity of information provided, from low to high), and quality of medical information (including whether it was complete and relevant).
Out of the 150 Medscape cases analyzed, ChatGPT provided correct answers in 49% of cases. However, the chatbot demonstrated an overall accuracy of 74%, meaning it could identify and reject incorrect multiple-choice options.
“This higher value is due to the ChatGPT’s ability to identify true negatives (incorrect options), which significantly contributes to the overall accuracy, enhancing its utility in eliminating incorrect choices,” the researchers explain. “This difference highlights ChatGPT’s high specificity, indicating its ability to excel at ruling out incorrect diagnoses. However, it needs improvement in precision and sensitivity to reliably identify the correct diagnosis.”
In addition, ChatGPT provided false positives (13%) and false negatives (13%), which has implications for its use as a diagnostic tool. A little over half (52%) of the answers provided were complete and relevant, with 43% incomplete but still relevant. ChatGPT tended to produce answers with a low (51%) to moderate (41%) cognitive load, making them easy to understand for users. However, the researchers point out that this ease of understanding, combined with the potential for incorrect or irrelevant information, could result in “misconceptions and a false sense of comprehension”, particularly if ChatGPT is being used as a medical education tool.
“ChatGPT also struggled to distinguish between diseases with subtly different presentations and the model also occasionally generated incorrect or implausible information, known as AI hallucinations, emphasizing the risk of sole reliance on ChatGPT for medical guidance and the necessity of human expertise in the diagnostic process,” said the researchers.
Of course – and the researchers point this out as a limitation of the study – ChatGPT 3.5 is only one AI model that may not be representative of other models and is bound to improve in future iterations, which may improve its accuracy. Also, the Medscape cases analyzed by ChatGPT primarily focused on differential diagnosis cases, where medical professionals must differentiate between two or more conditions with similar signs or symptoms.
While future research should assess the accuracy of different AI models using a wider range of case sources, the results of the present study are nonetheless instructive.
“The combination of high relevance with relatively low accuracy advises against relying on ChatGPT for medical counsel, as it can present important information that may be misleading,” the researchers said. “While our results indicate that ChatGPT consistently delivers the same information to different users, demonstrating substantial inter-rater reliability, it also reveals the tool’s shortcomings in providing factually correct medical information, as evident [sic] by its low diagnostic accuracy.”
The study was published in the journal PLOS One.
I also wonder what might happen if other AI's were also compared.
Not true. They looked at ChatGPT and in their conclusion "advise[d] against relying on ChatGPT for medical counsel". They didn't generalize it to all AI.
While I agree with the article's message -- the internet/Google/ChatGPT is no substitute for medical experts --, don't put words in the mouths of the researchers.
I wonder if MD's score so badly all the time, that they're unable to even tell us those stats?
I can, with certainty, say that modern AI as-of-now is 100% going to make mistakes ranging from smallish to catastrophic on basically every question relating to computer coding (having used it all-day every-day for a few months in a row now). I cannot remember even 1 time I got any question right first-try. I can also say with the same certainty that staff I employ suffer the same problems, if not worse (their work is much less scrutinised!).
I sympathise with the authors on the "Cheating" problem. AI's are the biggest cheats possible - literally having fed on the answers to everything for their existence. It's impossible to gauge their usefulness by asking them any question they've seen before, and humans are lazy: making up new questions is hard.
----
Which is one reason why we have so many bugs in computer code and why programs/apps have to be updated continuously. IMO, LLM's, AI's or whatever you want to call them will eventually produce almost 100% trouble free code on the first try and our lives will be much improved.