Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots often misrepresent scientific studies — and newer models may be worse

by Eric W. Dolan
May 20, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook
Follow PsyPost on Google News

Artificial intelligence chatbots are becoming popular tools for summarizing scientific research, but a new study suggests these systems often misrepresent the findings they summarize. Published in Royal Society Open Science, the study found that the most widely used language models frequently overgeneralize the results of scientific studies—sometimes making broader or more confident claims than the original research supports. This tendency was more common in newer models and, paradoxically, was worsened when the chatbots were explicitly asked to be accurate.

The study was led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge. The researchers were motivated by growing concerns about the use of large language models—such as ChatGPT, Claude, DeepSeek, and LLaMA—in science communication.

These tools are often praised for their ability to summarize complex material, but critics have warned that they may overlook important limitations or caveats, especially when converting technical findings into more readable language. Overgeneralizations can mislead readers, particularly when scientific results are treated as universally applicable or when uncertain findings are reframed as policy recommendations.

To test whether these concerns were justified, the researchers conducted a large-scale evaluation of 10 of the most prominent large language models. These included popular systems like GPT-4 Turbo, ChatGPT-4o, Claude 3.7 Sonnet, and DeepSeek. In total, they analyzed 4,900 chatbot-generated summaries of scientific texts.

The source material included 200 research abstracts from top science and medical journals such as Nature, Science, The Lancet, and The New England Journal of Medicine, as well as 100 full-length medical articles. For some of the full articles, the researchers also included expert-written summaries from NEJM Journal Watch to allow for comparisons between human- and AI-generated summaries.

Each summary was examined for signs of overgeneralization. The researchers focused on three specific features that broaden the scope of scientific claims: using generic statements instead of specific ones, changing past tense descriptions to present tense, and turning descriptive findings into action-oriented recommendations. For example, if a study stated that “participants in this trial experienced improvements,” a generalized version might say “this treatment improves outcomes,” which could falsely suggest a broader or more universal effect.

Most language models produced summaries that were significantly more likely to contain generalized conclusions than the original texts. In fact, summaries from newer models such as ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to include overgeneralizations. By contrast, earlier models such as GPT-3.5 and the Claude family were less likely to introduce these problems.

The researchers also found that prompting the models to be more accurate didn’t help—if anything, it made things worse. When models were instructed to “avoid inaccuracies,” they were nearly twice as likely to produce generalized statements compared to when they were simply asked to summarize the text. One explanation for this counterintuitive result may relate to how the models interpret prompts. Much like the human tendency to fixate on a thought when told not to think about it, the models may respond to reminders about accuracy by producing more authoritative-sounding—but misleading—summaries.

In addition to comparing chatbot summaries to the original research, the study also looked at how the models performed compared to human science writers. Specifically, the researchers compared model-generated summaries of medical research to expert-written summaries published in NEJM Journal Watch. They found that the human-authored summaries were much less likely to contain overgeneralizations. In fact, the chatbot-generated summaries were nearly five times more likely to broaden the scope of scientific conclusions beyond what the original studies supported.

Another interesting finding was the role of model settings. When researchers used an “API” to generate summaries with the temperature setting at 0—a value that makes the model more deterministic and less creative—the likelihood of overgeneralization dropped significantly. This suggests that controlling certain technical parameters can help reduce errors, though this option may not be available to everyday users accessing chatbots through standard web interfaces.

The researchers point out that not all generalizations are inherently bad. Sometimes, simplifying complex findings can make science more accessible, especially for non-experts. But when these generalizations go beyond the evidence, they can mislead readers and even pose risks. This is particularly concerning in high-stakes fields like medicine, where overstated claims could affect clinical decisions.

While the study focused on overgeneralizations, the researchers acknowledged that undergeneralizations can also occur. A model might turn a broadly supported finding into a narrowly worded summary, potentially downplaying important results. However, these instances were far less common than the overgeneralizations, which were the main focus of the research.

This study stands out not only for its scale and thoroughness but also for offering a clear framework to evaluate how well language models preserve the scope of scientific conclusions. The researchers suggest that developers and users of language models adopt several strategies to reduce the risk of misleading summaries. These include using models with more conservative settings, avoiding prompts that explicitly demand accuracy, and choosing systems like Claude that have shown greater fidelity to the original texts.

But the study has some limitations. It only tested a few prompt types, and it focused largely on medical research, which may not generalize to all scientific fields. The human-written summaries used for comparison were produced by expert audiences and may not reflect the kinds of summaries written for the general public. Future studies might explore how different prompting strategies or model configurations influence performance across a wider range of scientific disciplines.

The study, “Generalization bias in large language model summarization of scientific research,” was authored by Uwe Peters and Benjamin Chin-Yee.

TweetSendScanShareSendPinShareShareShareShareShare

RELATED

Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

Artificial confidence? People feel more creative after viewing AI-labeled content

May 16, 2025

A new study suggests that when people see creative work labeled as AI-generated rather than human-made, they feel more confident in their own abilities. The effect appears across jokes, drawings, poems, and more—and might stem from subtle social comparison processes.

Read moreDetails
AI-driven brain training reduces impulsiveness in kids with ADHD, study finds
ADHD

AI-driven brain training reduces impulsiveness in kids with ADHD, study finds

May 9, 2025

Researchers found that a personalized, game-based cognitive therapy powered by artificial intelligence significantly reduced impulsiveness and inattentiveness in children with ADHD. Brain scans showed signs of neurological improvement, highlighting the potential of AI tools in mental health treatment.

Read moreDetails
Neuroscientists use brain implants and AI to map language processing in real time
Artificial Intelligence

Neuroscientists use brain implants and AI to map language processing in real time

May 9, 2025

Researchers recorded brain activity during unscripted conversations and compared it to patterns in AI language models. The findings reveal a network of brain areas that track speech meaning and speaker transitions, offering a detailed picture of how we communicate.

Read moreDetails
Artificial intelligence: 7 eye-opening new scientific discoveries
Artificial Intelligence

Artificial intelligence: 7 eye-opening new scientific discoveries

May 8, 2025

As artificial intelligence becomes more advanced, researchers are uncovering both how these systems behave and how they influence human life. These seven recent studies offer insights into the psychology of AI—and what happens when humans and machines interact.

Read moreDetails
New study: AI can identify autism from tiny hand motion patterns
Artificial Intelligence

New study: AI can identify autism from tiny hand motion patterns

May 8, 2025

Hand movements during a basic grasping task can help identify autism, new research suggests. The study used motion tracking and machine learning to analyze finger movements and found that classification accuracy exceeded 84% using just two sensors.

Read moreDetails
Cognitive psychologist explains why AI images fool so many people
Artificial Intelligence

Cognitive psychologist explains why AI images fool so many people

May 7, 2025

Despite glaring errors, many AI-generated images go undetected by casual viewers. A cognitive psychologist explores how attention, perception, and mental shortcuts shape what we notice—and what we miss—while scrolling through our increasingly synthetic digital feeds.

Read moreDetails
Are AI lovers replacing real romantic partners? Surprising findings from new research
Artificial Intelligence

Are AI lovers replacing real romantic partners? Surprising findings from new research

May 4, 2025

Falling in love with a virtual character might change how people feel about real-life marriage. A recent study found that these digital romances can both dampen and strengthen marriage intentions, depending on the emotional and psychological effects involved.

Read moreDetails
The surprising link between conspiracy mentality and deepfake detection ability
Artificial Intelligence

Homemade political deepfakes can fool voters, but may not beat plain text misinformation

April 30, 2025

A new study finds that deepfakes made by an undergraduate student were able to sway political opinions and create false memories, but they weren't consistently more persuasive than written misinformation. The findings raise questions about the actual threat posed by amateur deepfakes in shaping public opinion.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

What brain scans reveal about the neural correlates of pornography consumption

AI chatbots often misrepresent scientific studies — and newer models may be worse

Is gender-affirming care helping or harming mental health?

Study finds “zombie” neurons in the peripheral nervous system contribute to chronic pain

Therapeutic video game shows promise for post-COVID cognitive recovery

Passive scrolling linked to increased anxiety in teens, study finds

Your bodily awareness guides your morality, new neuroscience study suggests

Where you flirt matters: New research shows setting shapes romantic success

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy