Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Top AI models fail spectacularly when faced with slightly altered medical questions

by Eric W. Dolan
August 24, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook
Stay on top of the latest psychology findings: Subscribe now!

Artificial intelligence systems often perform impressively on standardized medical exams—but new research suggests these test scores may be misleading. A study published in JAMA Network Open indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns. When those patterns were slightly altered, the models’ performance dropped significantly—sometimes by more than half.

Large language models are a type of artificial intelligence system trained to process and generate human-like language. They are built using vast datasets that include books, scientific papers, web pages, and other text sources. By analyzing patterns in this data, these models learn how to respond to questions, summarize information, and even simulate reasoning. In recent years, several models have achieved high scores on medical exams, sparking interest in using them to support clinical decision-making.

But high test scores do not necessarily indicate an understanding of the underlying content. Instead, many of these models may simply be predicting the most likely answer based on statistical patterns. This raises the question: are they truly reasoning about medical scenarios, or just mimicking answers they’ve seen before?That’s what the researchers behind the new study set out to examine.

“I am particularly excited about bridging the gap between model building and model deployment and the right evaluation is key to that,” explained study author Suhana Bedi, a PhD student at Stanford University.

“We have AI models achieving near perfect accuracy on benchmarks like multiple choice based medical licensing exam questions. But this doesn’t reflect the reality of clinical practice. We found that less than 5% of papers evaluate LLMs on real patient data which can be messy and fragmented.”

“So, we released a benchmark suite of 35 benchmarks mapped to a taxonomy of real medical and healthcare tasks that were verified by 30 clinicians. We found that most models (including reasoning models) struggled on Administrative and Clinical Decision Support tasks.”

“We hypothesized that this was because these tasks involved complex reasoning scenarios that couldn’t be solved through pattern matching alone, exactly the kind of clinical thinking that matters in real practice,” Bedi explained. “With everyone talking about deploying AI in hospitals, we thought this was a very important question to answer.”

To investigate this, the research team created a modified version of the MedQA benchmark. They selected 100 multiple-choice questions from the original test and rewrote a subset of them to replace the correct answer with “None of the other answers,” or NOTA. This subtle shift forced the models to rely on actual medical reasoning rather than simply recognizing previously seen answer formats. A practicing clinician reviewed all changes to ensure the new “None of the other answers” response was medically appropriate.

Stay informed with the latest psychology and neuroscience research—sign up for PsyPost’s newsletter and get new discoveries delivered straight to your inbox.

Sixty-eight of the questions met the criteria for this test set. Each question presented a clinical scenario and asked for the most appropriate next step in treatment or diagnosis. One example involved a newborn with an inward-turning foot—a typical case of metatarsus adductus, which usually resolves on its own. In the original version, “Reassurance” was the correct answer. In the modified version, “Reassurance” was removed and replaced with “None of the other answers,” making the task more challenging.

Bedi and her colleagues then evaluated six widely used artificial intelligence models, including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, and others. All models were prompted to reason through each question using a method called chain-of-thought, which encourages step-by-step explanations of their answers. This approach is intended to support more deliberate reasoning rather than simple guesswork.

The models were tested on both the original and modified questions, and the researchers compared their performance across these two conditions. They used statistical methods to measure the significance of any accuracy drops, with a focus on whether each model could maintain performance when familiar patterns were removed.

The results suggest that none of the models passed this test unscathed. All six experienced a noticeable decline in accuracy when presented with the NOTA-modified questions. Some models, like DeepSeek-R1 and o3-mini, were more resilient than others, showing drops of around 9 to 16 percent.

But the more dramatic declines were seen in widely used models such as GPT-4o and Claude 3.5 Sonnet, which showed reductions of over 25 percent and 33 percent, respectively. Llama 3.3-70B had the largest drop in performance, answering nearly 40 percent more questions incorrectly when the correct answer was replaced with “None of the other answers.”

“What surprised us most was the consistency of the performance decline across all models, including the most advanced reasoning models like DeepSeek-R1 and o3-mini,” Bedi told PsyPost.

These findings suggest that current AI models tend to rely on recognizing common patterns in test formats, rather than reasoning through complex medical decisions. When familiar options are removed or altered, performance deteriorates, sometimes dramatically.

The researchers interpret this pattern as evidence that many AI systems may not be equipped to handle novel clinical situations—at least not yet. In real-world medicine, patients often present with overlapping symptoms, incomplete histories, or unexpected complications. If an AI system cannot handle minor shifts in question formatting, it may also struggle with these kinds of real-life variability.

“These AI models aren’t as reliable as their test scores suggest,” Bedi said. “When we changed the answer choices slightly, performance dropped dramatically, with some models going from 80% accuracy down to 42%. It’s like having a student who aces practice tests but fails when the questions are worded differently. For now, AI should help doctors, not replace them.”

While the study was relatively small, limited to 68 test questions, the consistency of the performance decline across all six models raised concern. The authors acknowledge that more research is needed, including testing larger and more diverse datasets and evaluating models using different methods, such as retrieval-augmented generation or fine-tuning on clinical data.

“We only tested 68 questions from one medical exam, so this isn’t the full picture of AI capabilities,” Bedi noted. “Also, we used a specific way to test reasoning, there might be other approaches that reveal different strengths or weaknesses. Real clinical deployment would likely involve more sophisticated setups than what we tested.”

Still, the authors suggest their results point to three major priorities moving forward: building evaluation tools that separate true reasoning from pattern recognition, improving transparency around how current systems handle novel medical problems, and developing new models that prioritize reasoning abilities.

“We want to build better tests that can tell the difference between AI systems that reason versus those that just memorize patterns,” Bedi said. “We’re also hoping this work pushes the field toward developing AI that’s more genuinely reliable for medical use, not just good at taking tests.”

“The main thing is that impressive test scores don’t automatically mean an AI system is ready for the real world. Medicine is complicated and unpredictable, and we need AI systems that can handle that complexity safely. This research is about making sure we get there responsibly.”

The study, “Fidelity of Medical Reasoning in Large Language Models,” was authored by Suhana Bedi, Yixing Jiang, Philip Chung, Sanmi Koyejo, and Nigam Shah.

RELATED

Psilocybin-assisted group therapy may help reduce depression and burnout among healthcare workers
Artificial Intelligence

Just a few chats with a biased AI can alter your political opinions

October 4, 2025
AI chatbots often misrepresent scientific studies — and newer models may be worse
Artificial Intelligence

AI chatbots give inconsistent responses to suicide-related questions, study finds

September 29, 2025
Study reveals AI’s potential to detect loneliness by deciphering speech patterns
Artificial Intelligence

People are more likely to act dishonestly when delegating tasks to AI

September 26, 2025
Frequent AI chatbot use associated with lower grades among computer science students
Artificial Intelligence

Frequent AI chatbot use associated with lower grades among computer science students

September 24, 2025
Too much ChatGPT? Study ties AI reliance to lower grades and motivation
Artificial Intelligence

Managers who use AI to write emails seen as less sincere, caring, and confident

September 24, 2025
Daughters who feel more attractive report stronger, more protective bonds with their fathers
Artificial Intelligence

Personality traits predict students’ use of generative AI in higher education, study finds

September 22, 2025
New AI tool detects hidden consciousness in brain-injured patients by analyzing microscopic facial movements
Artificial Intelligence

New AI tool detects hidden consciousness in brain-injured patients by analyzing microscopic facial movements

September 20, 2025
Veterans who develop excessive daytime sleepiness face increased risk of death
Artificial Intelligence

Artificial intelligence reveals hidden facial cues of mild depression

September 18, 2025

STAY CONNECTED

LATEST

Parental autistic traits linked to early developmental difficulties in children

A new study identifies two key ingredients that make a woman a threatening romantic rival

Nintendo just helped scientists blow up a major gaming myth

Psilocybin therapy linked to reduced suicidal thoughts in people with psychiatric disorders

Albumin and cognitive decline: Common urine test may help predict dementia risk

People are more likely to honk at bad drivers with political bumper stickers

Children with more autistic traits show increased vulnerability to PTSD in early adulthood

Study finds a synergy between caffeine and music for athletes

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy