Top AI models fail spectacularly when faced with slightly altered medical questions

Artificial intelligence systems often perform impressively on standardized medical exams—but new research suggests these test scores may be misleading. A study published in JAMA Network Open indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns. When those patterns were slightly altered, the models’ performance dropped significantly—sometimes by more than half.

Large language models are a type of artificial intelligence system trained to process and generate human-like language. They are built using vast datasets that include books, scientific papers, web pages, and other text sources. By analyzing patterns in this data, these models learn how to respond to questions, summarize information, and even simulate reasoning. In recent years, several models have achieved high scores on medical exams, sparking interest in using them to support clinical decision-making.

But high test scores do not necessarily indicate an understanding of the underlying content. Instead, many of these models may simply be predicting the most likely answer based on statistical patterns. This raises the question: are they truly reasoning about medical scenarios, or just mimicking answers they’ve seen before?That’s what the researchers behind the new study set out to examine.

“I am particularly excited about bridging the gap between model building and model deployment and the right evaluation is key to that,” explained study author Suhana Bedi, a PhD student at Stanford University.

“We have AI models achieving near perfect accuracy on benchmarks like multiple choice based medical licensing exam questions. But this doesn’t reflect the reality of clinical practice. We found that less than 5% of papers evaluate LLMs on real patient data which can be messy and fragmented.”

“So, we released a benchmark suite of 35 benchmarks mapped to a taxonomy of real medical and healthcare tasks that were verified by 30 clinicians. We found that most models (including reasoning models) struggled on Administrative and Clinical Decision Support tasks.”

“We hypothesized that this was because these tasks involved complex reasoning scenarios that couldn’t be solved through pattern matching alone, exactly the kind of clinical thinking that matters in real practice,” Bedi explained. “With everyone talking about deploying AI in hospitals, we thought this was a very important question to answer.”

To investigate this, the research team created a modified version of the MedQA benchmark. They selected 100 multiple-choice questions from the original test and rewrote a subset of them to replace the correct answer with “None of the other answers,” or NOTA. This subtle shift forced the models to rely on actual medical reasoning rather than simply recognizing previously seen answer formats. A practicing clinician reviewed all changes to ensure the new “None of the other answers” response was medically appropriate.

Google News Preferences Add PsyPost to your preferred sources

Stay informed with the latest psychology and neuroscience research—sign up for PsyPost’s newsletter and get new discoveries delivered straight to your inbox.

Sixty-eight of the questions met the criteria for this test set. Each question presented a clinical scenario and asked for the most appropriate next step in treatment or diagnosis. One example involved a newborn with an inward-turning foot—a typical case of metatarsus adductus, which usually resolves on its own. In the original version, “Reassurance” was the correct answer. In the modified version, “Reassurance” was removed and replaced with “None of the other answers,” making the task more challenging.

Bedi and her colleagues then evaluated six widely used artificial intelligence models, including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, and others. All models were prompted to reason through each question using a method called chain-of-thought, which encourages step-by-step explanations of their answers. This approach is intended to support more deliberate reasoning rather than simple guesswork.

The models were tested on both the original and modified questions, and the researchers compared their performance across these two conditions. They used statistical methods to measure the significance of any accuracy drops, with a focus on whether each model could maintain performance when familiar patterns were removed.

The results suggest that none of the models passed this test unscathed. All six experienced a noticeable decline in accuracy when presented with the NOTA-modified questions. Some models, like DeepSeek-R1 and o3-mini, were more resilient than others, showing drops of around 9 to 16 percent.

But the more dramatic declines were seen in widely used models such as GPT-4o and Claude 3.5 Sonnet, which showed reductions of over 25 percent and 33 percent, respectively. Llama 3.3-70B had the largest drop in performance, answering nearly 40 percent more questions incorrectly when the correct answer was replaced with “None of the other answers.”

“What surprised us most was the consistency of the performance decline across all models, including the most advanced reasoning models like DeepSeek-R1 and o3-mini,” Bedi told PsyPost.

These findings suggest that current AI models tend to rely on recognizing common patterns in test formats, rather than reasoning through complex medical decisions. When familiar options are removed or altered, performance deteriorates, sometimes dramatically.

The researchers interpret this pattern as evidence that many AI systems may not be equipped to handle novel clinical situations—at least not yet. In real-world medicine, patients often present with overlapping symptoms, incomplete histories, or unexpected complications. If an AI system cannot handle minor shifts in question formatting, it may also struggle with these kinds of real-life variability.

“These AI models aren’t as reliable as their test scores suggest,” Bedi said. “When we changed the answer choices slightly, performance dropped dramatically, with some models going from 80% accuracy down to 42%. It’s like having a student who aces practice tests but fails when the questions are worded differently. For now, AI should help doctors, not replace them.”

While the study was relatively small, limited to 68 test questions, the consistency of the performance decline across all six models raised concern. The authors acknowledge that more research is needed, including testing larger and more diverse datasets and evaluating models using different methods, such as retrieval-augmented generation or fine-tuning on clinical data.

“We only tested 68 questions from one medical exam, so this isn’t the full picture of AI capabilities,” Bedi noted. “Also, we used a specific way to test reasoning, there might be other approaches that reveal different strengths or weaknesses. Real clinical deployment would likely involve more sophisticated setups than what we tested.”

Still, the authors suggest their results point to three major priorities moving forward: building evaluation tools that separate true reasoning from pattern recognition, improving transparency around how current systems handle novel medical problems, and developing new models that prioritize reasoning abilities.

“We want to build better tests that can tell the difference between AI systems that reason versus those that just memorize patterns,” Bedi said. “We’re also hoping this work pushes the field toward developing AI that’s more genuinely reliable for medical use, not just good at taking tests.”

“The main thing is that impressive test scores don’t automatically mean an AI system is ready for the real world. Medicine is complicated and unpredictable, and we need AI systems that can handle that complexity safely. This research is about making sure we get there responsibly.”

The study, “Fidelity of Medical Reasoning in Large Language Models,” was authored by Suhana Bedi, Yixing Jiang, Philip Chung, Sanmi Koyejo, and Nigam Shah.

Top AI models fail spectacularly when faced with slightly altered medical questions

A new frontier in autism research: predicting risk in babies as young as two months

Cannabidiol shows potential to reverse some neuropsychological effects of social stress

RELATED

How personality and culture relate to our perceptions of artificial intelligence

The presence of robot eyes affects perception of mind

AI art fails to trigger the same empathy as human works

AI chatbots generate weight loss coaching messages perceived as helpful as human-written advice

Scientists use machine learning to control specific brain circuits

Bias against AI art is so deep it changes how viewers perceive color and brightness

AI boosts worker creativity only if they use specific thinking strategies

Psychology study sheds light on the phenomenon of waifus and husbandos

STAY CONNECTED

LATEST

How personality and culture relate to our perceptions of artificial intelligence

Grandiose narcissists tend to show reduced neural sensitivity to errors

Left-wing authoritarians use egotistical social tactics more often

Adding extra salt to your food might increase your risk of depression

Reading may protect older adults against loneliness better than some social activities

Neurological risks rise as vaccination rates fall and measles returns

New research suggests the “lying flat” lifestyle actively decreases long-term happiness

A one-month behavioral treatment for social anxiety lowers hostile interpretations of others

Welcome Back!

Retrieve your password

Add New Playlist