Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots often misrepresent scientific studies — and newer models may be worse

by Eric W. Dolan
May 20, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook
Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.

Artificial intelligence chatbots are becoming popular tools for summarizing scientific research, but a new study suggests these systems often misrepresent the findings they summarize. Published in Royal Society Open Science, the study found that the most widely used language models frequently overgeneralize the results of scientific studies—sometimes making broader or more confident claims than the original research supports. This tendency was more common in newer models and, paradoxically, was worsened when the chatbots were explicitly asked to be accurate.

The study was led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge. The researchers were motivated by growing concerns about the use of large language models—such as ChatGPT, Claude, DeepSeek, and LLaMA—in science communication.

These tools are often praised for their ability to summarize complex material, but critics have warned that they may overlook important limitations or caveats, especially when converting technical findings into more readable language. Overgeneralizations can mislead readers, particularly when scientific results are treated as universally applicable or when uncertain findings are reframed as policy recommendations.

To test whether these concerns were justified, the researchers conducted a large-scale evaluation of 10 of the most prominent large language models. These included popular systems like GPT-4 Turbo, ChatGPT-4o, Claude 3.7 Sonnet, and DeepSeek. In total, they analyzed 4,900 chatbot-generated summaries of scientific texts.

The source material included 200 research abstracts from top science and medical journals such as Nature, Science, The Lancet, and The New England Journal of Medicine, as well as 100 full-length medical articles. For some of the full articles, the researchers also included expert-written summaries from NEJM Journal Watch to allow for comparisons between human- and AI-generated summaries.

Each summary was examined for signs of overgeneralization. The researchers focused on three specific features that broaden the scope of scientific claims: using generic statements instead of specific ones, changing past tense descriptions to present tense, and turning descriptive findings into action-oriented recommendations. For example, if a study stated that “participants in this trial experienced improvements,” a generalized version might say “this treatment improves outcomes,” which could falsely suggest a broader or more universal effect.

Most language models produced summaries that were significantly more likely to contain generalized conclusions than the original texts. In fact, summaries from newer models such as ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to include overgeneralizations. By contrast, earlier models such as GPT-3.5 and the Claude family were less likely to introduce these problems.

The researchers also found that prompting the models to be more accurate didn’t help—if anything, it made things worse. When models were instructed to “avoid inaccuracies,” they were nearly twice as likely to produce generalized statements compared to when they were simply asked to summarize the text. One explanation for this counterintuitive result may relate to how the models interpret prompts. Much like the human tendency to fixate on a thought when told not to think about it, the models may respond to reminders about accuracy by producing more authoritative-sounding—but misleading—summaries.

In addition to comparing chatbot summaries to the original research, the study also looked at how the models performed compared to human science writers. Specifically, the researchers compared model-generated summaries of medical research to expert-written summaries published in NEJM Journal Watch. They found that the human-authored summaries were much less likely to contain overgeneralizations. In fact, the chatbot-generated summaries were nearly five times more likely to broaden the scope of scientific conclusions beyond what the original studies supported.

Another interesting finding was the role of model settings. When researchers used an “API” to generate summaries with the temperature setting at 0—a value that makes the model more deterministic and less creative—the likelihood of overgeneralization dropped significantly. This suggests that controlling certain technical parameters can help reduce errors, though this option may not be available to everyday users accessing chatbots through standard web interfaces.

The researchers point out that not all generalizations are inherently bad. Sometimes, simplifying complex findings can make science more accessible, especially for non-experts. But when these generalizations go beyond the evidence, they can mislead readers and even pose risks. This is particularly concerning in high-stakes fields like medicine, where overstated claims could affect clinical decisions.

While the study focused on overgeneralizations, the researchers acknowledged that undergeneralizations can also occur. A model might turn a broadly supported finding into a narrowly worded summary, potentially downplaying important results. However, these instances were far less common than the overgeneralizations, which were the main focus of the research.

This study stands out not only for its scale and thoroughness but also for offering a clear framework to evaluate how well language models preserve the scope of scientific conclusions. The researchers suggest that developers and users of language models adopt several strategies to reduce the risk of misleading summaries. These include using models with more conservative settings, avoiding prompts that explicitly demand accuracy, and choosing systems like Claude that have shown greater fidelity to the original texts.

But the study has some limitations. It only tested a few prompt types, and it focused largely on medical research, which may not generalize to all scientific fields. The human-written summaries used for comparison were produced by expert audiences and may not reflect the kinds of summaries written for the general public. Future studies might explore how different prompting strategies or model configurations influence performance across a wider range of scientific disciplines.

The study, “Generalization bias in large language model summarization of scientific research,” was authored by Uwe Peters and Benjamin Chin-Yee.

TweetSendScanShareSendPinShareShareShareShareShare

RELATED

Dark personality traits and specific humor styles are linked to online trolling, study finds
Artificial Intelligence

Memes can serve as strong indicators of coming mass violence

June 15, 2025

A new study finds that surges in visual propaganda—like memes and doctored images—often precede political violence. By combining AI with expert analysis, researchers tracked manipulated content leading up to Russia’s invasion of Ukraine, revealing early warning signs of instability.

Read moreDetails
Teen depression tied to balance of adaptive and maladaptive emotional strategies, study finds
Artificial Intelligence

Sleep problems top list of predictors for teen mental illness, AI-powered study finds

June 15, 2025

A new study using data from over 11,000 adolescents found that sleep disturbances were the most powerful predictor of future mental health problems—more so than trauma or family history. AI models based on questionnaires outperformed those using brain scans.

Read moreDetails
New research links certain types of narcissism to anti-immigrant attitudes
Artificial Intelligence

Fears about AI push workers to embrace creativity over coding, new research suggests

June 13, 2025

A new study shows that when workers feel threatened by artificial intelligence, they tend to highlight creativity—rather than technical or social skills—in job applications and education choices. The research suggests people see creativity as a uniquely human skill machines can’t replace.

Read moreDetails
Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

A neuroscientist explains why it’s impossible for AI to “understand” language

June 12, 2025

Can artificial intelligence truly “understand” language the way humans do? A neuroscientist challenges this popular belief, arguing that machines may generate convincing text—but they lack the emotional, contextual, and biological grounding that gives real meaning to human communication.

Read moreDetails
Scientists reveal ChatGPT’s left-wing bias — and how to “jailbreak” it
Artificial Intelligence

ChatGPT mimics human cognitive dissonance in psychological experiments, study finds

June 3, 2025

OpenAI’s GPT-4o demonstrated behavior resembling cognitive dissonance in a psychological experiment. After writing essays about Vladimir Putin, the AI changed its evaluations—especially when it thought it had freely chosen which argument to make, echoing patterns seen in people.

Read moreDetails
Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

East Asians more open to chatbot companionship than Westerners

May 30, 2025

A new study highlights cultural differences in attitudes toward AI companionship. East Asian participants were more open to emotionally connecting with chatbots, a pattern linked to greater anthropomorphism and differing exposure to social robots across regions.

Read moreDetails
AI can predict intimate partner femicide from variables extracted from legal documents
Artificial Intelligence

Being honest about using AI can backfire on your credibility

May 29, 2025

New research reveals a surprising downside to AI transparency: people who admit to using AI at work are seen as less trustworthy. Across 13 experiments, disclosing AI use consistently reduced credibility—even among tech-savvy evaluators and in professional contexts.

Read moreDetails
Too much ChatGPT? Study ties AI reliance to lower grades and motivation
Artificial Intelligence

Too much ChatGPT? Study ties AI reliance to lower grades and motivation

May 27, 2025

A new study suggests that conscientious students are less likely to use generative AI tools like ChatGPT and that this may work in their favor. Frequent AI users reported lower grades, weaker academic confidence, and greater feelings of helplessness.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

Frequent pornography use linked to altered brain connectivity and impaired cognitive performance

Childhood trauma linked to changes in brain structure and connectivity, study finds

Psilocybin-assisted therapy linked to reduced depression in people with bipolar disorder, small study finds

COVID-19 coverage linked to rise in anti-Asian sentiment, especially among Trump supporters

Some dark personality traits may help buffer against depression, new psychology research suggests

Dementia risk begins in childhood, not old age, scientists warn

Millennials are abandoning organized religion. A new study provides insight into why

Sleep regularity might be protective of adolescents’ mental health, study suggests

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy