Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots often misrepresent scientific studies — and newer models may be worse

by Eric W. Dolan
May 20, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook
Follow PsyPost on Google News

Artificial intelligence chatbots are becoming popular tools for summarizing scientific research, but a new study suggests these systems often misrepresent the findings they summarize. Published in Royal Society Open Science, the study found that the most widely used language models frequently overgeneralize the results of scientific studies—sometimes making broader or more confident claims than the original research supports. This tendency was more common in newer models and, paradoxically, was worsened when the chatbots were explicitly asked to be accurate.

The study was led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge. The researchers were motivated by growing concerns about the use of large language models—such as ChatGPT, Claude, DeepSeek, and LLaMA—in science communication.

These tools are often praised for their ability to summarize complex material, but critics have warned that they may overlook important limitations or caveats, especially when converting technical findings into more readable language. Overgeneralizations can mislead readers, particularly when scientific results are treated as universally applicable or when uncertain findings are reframed as policy recommendations.

To test whether these concerns were justified, the researchers conducted a large-scale evaluation of 10 of the most prominent large language models. These included popular systems like GPT-4 Turbo, ChatGPT-4o, Claude 3.7 Sonnet, and DeepSeek. In total, they analyzed 4,900 chatbot-generated summaries of scientific texts.

The source material included 200 research abstracts from top science and medical journals such as Nature, Science, The Lancet, and The New England Journal of Medicine, as well as 100 full-length medical articles. For some of the full articles, the researchers also included expert-written summaries from NEJM Journal Watch to allow for comparisons between human- and AI-generated summaries.

Each summary was examined for signs of overgeneralization. The researchers focused on three specific features that broaden the scope of scientific claims: using generic statements instead of specific ones, changing past tense descriptions to present tense, and turning descriptive findings into action-oriented recommendations. For example, if a study stated that “participants in this trial experienced improvements,” a generalized version might say “this treatment improves outcomes,” which could falsely suggest a broader or more universal effect.

Most language models produced summaries that were significantly more likely to contain generalized conclusions than the original texts. In fact, summaries from newer models such as ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to include overgeneralizations. By contrast, earlier models such as GPT-3.5 and the Claude family were less likely to introduce these problems.

The researchers also found that prompting the models to be more accurate didn’t help—if anything, it made things worse. When models were instructed to “avoid inaccuracies,” they were nearly twice as likely to produce generalized statements compared to when they were simply asked to summarize the text. One explanation for this counterintuitive result may relate to how the models interpret prompts. Much like the human tendency to fixate on a thought when told not to think about it, the models may respond to reminders about accuracy by producing more authoritative-sounding—but misleading—summaries.

In addition to comparing chatbot summaries to the original research, the study also looked at how the models performed compared to human science writers. Specifically, the researchers compared model-generated summaries of medical research to expert-written summaries published in NEJM Journal Watch. They found that the human-authored summaries were much less likely to contain overgeneralizations. In fact, the chatbot-generated summaries were nearly five times more likely to broaden the scope of scientific conclusions beyond what the original studies supported.

Another interesting finding was the role of model settings. When researchers used an “API” to generate summaries with the temperature setting at 0—a value that makes the model more deterministic and less creative—the likelihood of overgeneralization dropped significantly. This suggests that controlling certain technical parameters can help reduce errors, though this option may not be available to everyday users accessing chatbots through standard web interfaces.

The researchers point out that not all generalizations are inherently bad. Sometimes, simplifying complex findings can make science more accessible, especially for non-experts. But when these generalizations go beyond the evidence, they can mislead readers and even pose risks. This is particularly concerning in high-stakes fields like medicine, where overstated claims could affect clinical decisions.

While the study focused on overgeneralizations, the researchers acknowledged that undergeneralizations can also occur. A model might turn a broadly supported finding into a narrowly worded summary, potentially downplaying important results. However, these instances were far less common than the overgeneralizations, which were the main focus of the research.

This study stands out not only for its scale and thoroughness but also for offering a clear framework to evaluate how well language models preserve the scope of scientific conclusions. The researchers suggest that developers and users of language models adopt several strategies to reduce the risk of misleading summaries. These include using models with more conservative settings, avoiding prompts that explicitly demand accuracy, and choosing systems like Claude that have shown greater fidelity to the original texts.

But the study has some limitations. It only tested a few prompt types, and it focused largely on medical research, which may not generalize to all scientific fields. The human-written summaries used for comparison were produced by expert audiences and may not reflect the kinds of summaries written for the general public. Future studies might explore how different prompting strategies or model configurations influence performance across a wider range of scientific disciplines.

The study, “Generalization bias in large language model summarization of scientific research,” was authored by Uwe Peters and Benjamin Chin-Yee.

RELATED

AI chatbots outperform humans in evaluating social situations, study finds
Artificial Intelligence

Humans still beat AI at one key creative task, new study finds

July 25, 2025

Is AI the best brainstorming partner? Not quite, according to new research. Human pairs came up with more original ideas and felt more creatively confident than those working with ChatGPT or Google in a series of collaborative thinking tasks.

Read moreDetails
New psychology study: Inner reasons for seeking romance are a top predictor of finding it
Artificial Intelligence

Scientists demonstrate that “AI’s superhuman persuasiveness is already a reality”

July 18, 2025

A recent study reveals that AI is not just a capable debater but a superior one. When personalized, ChatGPT's arguments were over 64% more likely to sway opinions than a human's, a significant and potentially concerning leap in persuasive capability.

Read moreDetails
Trump’s speeches stump AI: Study reveals ChatGPT’s struggle with metaphors
Artificial Intelligence

Trump’s speeches stump AI: Study reveals ChatGPT’s struggle with metaphors

July 15, 2025

Can an AI understand a political metaphor? Researchers pitted ChatGPT against the speeches of Donald Trump to find out. The model showed moderate success in detection but ultimately struggled with context, highlighting the current limits of automated language analysis.

Read moreDetails
Daughters who feel more attractive report stronger, more protective bonds with their fathers
Artificial Intelligence

People who use AI may pay a social price, according to new psychology research

July 14, 2025

Worried that using AI tools like ChatGPT at work makes you look lazy? New research suggests you might be right. A study finds employees who use AI are often judged more harshly, facing negative perceptions about their competence and effort.

Read moreDetails
Is ChatGPT really more creative than humans? New research provides an intriguing test
ADHD

Scientists use deep learning to uncover hidden motor signs of neurodivergence

July 10, 2025

Diagnosing autism and attention-related conditions often takes months, if not years. But new research shows that analyzing how people move their hands during simple tasks, with the help of artificial intelligence, could offer a faster, objective path to early detection.

Read moreDetails
Positive attitudes toward AI linked to problematic social media use
Artificial Intelligence

Positive attitudes toward AI linked to problematic social media use

July 7, 2025

A new study suggests that people who view artificial intelligence positively may be more likely to overuse social media. The findings highlight a potential link between attitudes toward AI and problematic online behavior, especially among male users.

Read moreDetails
Stress disrupts gut and brain barriers by reducing key microbial metabolites, study finds
Artificial Intelligence

Dark personality traits linked to generative AI use among art students

July 5, 2025

As generative AI tools become staples in art education, a new study uncovers who misuses them most. Research on Chinese art students connects "dark traits" like psychopathy to academic dishonesty, negative thinking, and a heavier reliance on AI technologies.

Read moreDetails
AI can already diagnose depression better than a doctor and tell you which treatment is best
Artificial Intelligence

New research reveals hidden biases in AI’s moral advice

July 5, 2025

Can you trust AI with your toughest moral questions? A new study suggests thinking twice. Researchers found large language models consistently favor inaction and "no" in ethical dilemmas.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

A startling psychology study has linked nightmares to premature death

New research supports the universality of maternal sensitivity in shaping child attachment

People in open relationships report better sexual communication

Individuals with alcohol use disorder have much higher concentration of glutathione in certain brain areas

Humans still beat AI at one key creative task, new study finds

Study shows Congressional stock gains come at democracy’s expense

Psychedelics alter neurochemical signals tied to hunger and mood in the hypothalamus

Zapping the brain’s prefrontal cortex with electricity helps people learn math

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy