Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots often misrepresent scientific studies — and newer models may be worse

by Eric W. Dolan
May 20, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook
Follow PsyPost on Google News

Artificial intelligence chatbots are becoming popular tools for summarizing scientific research, but a new study suggests these systems often misrepresent the findings they summarize. Published in Royal Society Open Science, the study found that the most widely used language models frequently overgeneralize the results of scientific studies—sometimes making broader or more confident claims than the original research supports. This tendency was more common in newer models and, paradoxically, was worsened when the chatbots were explicitly asked to be accurate.

The study was led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge. The researchers were motivated by growing concerns about the use of large language models—such as ChatGPT, Claude, DeepSeek, and LLaMA—in science communication.

These tools are often praised for their ability to summarize complex material, but critics have warned that they may overlook important limitations or caveats, especially when converting technical findings into more readable language. Overgeneralizations can mislead readers, particularly when scientific results are treated as universally applicable or when uncertain findings are reframed as policy recommendations.

To test whether these concerns were justified, the researchers conducted a large-scale evaluation of 10 of the most prominent large language models. These included popular systems like GPT-4 Turbo, ChatGPT-4o, Claude 3.7 Sonnet, and DeepSeek. In total, they analyzed 4,900 chatbot-generated summaries of scientific texts.

The source material included 200 research abstracts from top science and medical journals such as Nature, Science, The Lancet, and The New England Journal of Medicine, as well as 100 full-length medical articles. For some of the full articles, the researchers also included expert-written summaries from NEJM Journal Watch to allow for comparisons between human- and AI-generated summaries.

Each summary was examined for signs of overgeneralization. The researchers focused on three specific features that broaden the scope of scientific claims: using generic statements instead of specific ones, changing past tense descriptions to present tense, and turning descriptive findings into action-oriented recommendations. For example, if a study stated that “participants in this trial experienced improvements,” a generalized version might say “this treatment improves outcomes,” which could falsely suggest a broader or more universal effect.

Most language models produced summaries that were significantly more likely to contain generalized conclusions than the original texts. In fact, summaries from newer models such as ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to include overgeneralizations. By contrast, earlier models such as GPT-3.5 and the Claude family were less likely to introduce these problems.

The researchers also found that prompting the models to be more accurate didn’t help—if anything, it made things worse. When models were instructed to “avoid inaccuracies,” they were nearly twice as likely to produce generalized statements compared to when they were simply asked to summarize the text. One explanation for this counterintuitive result may relate to how the models interpret prompts. Much like the human tendency to fixate on a thought when told not to think about it, the models may respond to reminders about accuracy by producing more authoritative-sounding—but misleading—summaries.

In addition to comparing chatbot summaries to the original research, the study also looked at how the models performed compared to human science writers. Specifically, the researchers compared model-generated summaries of medical research to expert-written summaries published in NEJM Journal Watch. They found that the human-authored summaries were much less likely to contain overgeneralizations. In fact, the chatbot-generated summaries were nearly five times more likely to broaden the scope of scientific conclusions beyond what the original studies supported.

Another interesting finding was the role of model settings. When researchers used an “API” to generate summaries with the temperature setting at 0—a value that makes the model more deterministic and less creative—the likelihood of overgeneralization dropped significantly. This suggests that controlling certain technical parameters can help reduce errors, though this option may not be available to everyday users accessing chatbots through standard web interfaces.

The researchers point out that not all generalizations are inherently bad. Sometimes, simplifying complex findings can make science more accessible, especially for non-experts. But when these generalizations go beyond the evidence, they can mislead readers and even pose risks. This is particularly concerning in high-stakes fields like medicine, where overstated claims could affect clinical decisions.

While the study focused on overgeneralizations, the researchers acknowledged that undergeneralizations can also occur. A model might turn a broadly supported finding into a narrowly worded summary, potentially downplaying important results. However, these instances were far less common than the overgeneralizations, which were the main focus of the research.

This study stands out not only for its scale and thoroughness but also for offering a clear framework to evaluate how well language models preserve the scope of scientific conclusions. The researchers suggest that developers and users of language models adopt several strategies to reduce the risk of misleading summaries. These include using models with more conservative settings, avoiding prompts that explicitly demand accuracy, and choosing systems like Claude that have shown greater fidelity to the original texts.

But the study has some limitations. It only tested a few prompt types, and it focused largely on medical research, which may not generalize to all scientific fields. The human-written summaries used for comparison were produced by expert audiences and may not reflect the kinds of summaries written for the general public. Future studies might explore how different prompting strategies or model configurations influence performance across a wider range of scientific disciplines.

The study, “Generalization bias in large language model summarization of scientific research,” was authored by Uwe Peters and Benjamin Chin-Yee.

RELATED

Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

Researchers fed 7.9 million speeches into AI—and what they found upends our understanding of language

August 23, 2025

A massive linguistic study challenges the belief that language change is driven by young people alone. Researchers found that older adults often adopt new word meanings within a few years—and sometimes even lead the change themselves.

Read moreDetails
His psychosis was a mystery—until doctors learned about ChatGPT’s health advice
Artificial Intelligence

His psychosis was a mystery—until doctors learned about ChatGPT’s health advice

August 13, 2025

Doctors were baffled when a healthy man developed hallucinations and paranoia. The cause? Bromide toxicity—triggered by an AI-guided experiment to eliminate chloride from his diet. The case raises new concerns about how people use chatbots like ChatGPT for health advice.

Read moreDetails
Brain imaging study reveals blunted empathic response to others’ pain when following orders
Artificial Intelligence

Machine learning helps tailor deep brain stimulation to improve gait in Parkinson’s disease

August 12, 2025

A new study shows that adjusting deep brain stimulation settings based on wearable sensor data and brain recordings can enhance walking in Parkinson’s disease. The personalized approach improved gait performance and revealed neural signatures linked to mobility gains.

Read moreDetails
Assimilation-induced dehumanization: Psychology research uncovers a dark side effect of AI
Artificial Intelligence

Assimilation-induced dehumanization: Psychology research uncovers a dark side effect of AI

August 11, 2025

As AI becomes more empathetic, a surprising psychological shift occurs. New research finds that interacting with emotionally intelligent machines can make us see real people as more machine-like, subtly eroding our respect for humanity.

Read moreDetails
Pet dogs fail to favor generous people over selfish ones in tests
Artificial Intelligence

AI’s personality-reading powers aren’t always what they seem, study finds

August 9, 2025

A closer look at AI language models shows that while they can detect meaningful personality signals in text, much of their success with certain datasets comes from exploiting superficial cues, raising questions about the validity of some assessments.

Read moreDetails
High sensitivity may protect against anomalous psychological phenomena
Artificial Intelligence

ChatGPT psychosis? This scientist predicted AI-induced delusions — two years later it appears he was right

August 7, 2025

A psychiatrist’s 2023 warning that AI chatbots could trigger psychosis now appears eerily accurate. Real-world cases show vulnerable users falling into delusional spirals after intense chatbot interactions—raising urgent questions about the mental health risks of generative artificial intelligence.

Read moreDetails
Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

Conservatives are more receptive to AI-generated recommendations than liberals, study finds

August 4, 2025

Contrary to popular belief, conservatives may be more receptive to AI in everyday life. A series of studies finds that conservatives are more likely than liberals to accept AI-generated recommendations.

Read moreDetails
AI chatbots outperform humans in evaluating social situations, study finds
Artificial Intelligence

Humans still beat AI at one key creative task, new study finds

July 25, 2025

Is AI the best brainstorming partner? Not quite, according to new research. Human pairs came up with more original ideas and felt more creatively confident than those working with ChatGPT or Google in a series of collaborative thinking tasks.

Read moreDetails

STAY CONNECTED

LATEST

A new frontier in autism research: predicting risk in babies as young as two months

Cerebellar-prefrontal brain connectivity may shape negative symptoms in psychosis

Children’s self-estimates of IQ become more accurate with age—but only to a point

Women feel unsafe when objectified—but may still self-sexualize if the man is attractive or wealthy

Most TikTok videos about birth control are unreliable, study finds

Researchers fed 7.9 million speeches into AI—and what they found upends our understanding of language

Americans broadly agree on what’s “woke,” but partisan cues still shape perceptions

Scientists rewired people’s romantic “type” using a made-up trait—here’s what happened next

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy