AI chatbots often misrepresent scientific studies

Artificial intelligence chatbots are becoming popular tools for summarizing scientific research, but a new study suggests these systems often misrepresent the findings they summarize. Published in Royal Society Open Science, the study found that the most widely used language models frequently overgeneralize the results of scientific studies—sometimes making broader or more confident claims than the original research supports. This tendency was more common in newer models and, paradoxically, was worsened when the chatbots were explicitly asked to be accurate.

The study was led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge. The researchers were motivated by growing concerns about the use of large language models—such as ChatGPT, Claude, DeepSeek, and LLaMA—in science communication.

These tools are often praised for their ability to summarize complex material, but critics have warned that they may overlook important limitations or caveats, especially when converting technical findings into more readable language. Overgeneralizations can mislead readers, particularly when scientific results are treated as universally applicable or when uncertain findings are reframed as policy recommendations.

To test whether these concerns were justified, the researchers conducted a large-scale evaluation of 10 of the most prominent large language models. These included popular systems like GPT-4 Turbo, ChatGPT-4o, Claude 3.7 Sonnet, and DeepSeek. In total, they analyzed 4,900 chatbot-generated summaries of scientific texts.

The source material included 200 research abstracts from top science and medical journals such as Nature, Science, The Lancet, and The New England Journal of Medicine, as well as 100 full-length medical articles. For some of the full articles, the researchers also included expert-written summaries from NEJM Journal Watch to allow for comparisons between human- and AI-generated summaries.

Each summary was examined for signs of overgeneralization. The researchers focused on three specific features that broaden the scope of scientific claims: using generic statements instead of specific ones, changing past tense descriptions to present tense, and turning descriptive findings into action-oriented recommendations. For example, if a study stated that “participants in this trial experienced improvements,” a generalized version might say “this treatment improves outcomes,” which could falsely suggest a broader or more universal effect.

Most language models produced summaries that were significantly more likely to contain generalized conclusions than the original texts. In fact, summaries from newer models such as ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to include overgeneralizations. By contrast, earlier models such as GPT-3.5 and the Claude family were less likely to introduce these problems.

The researchers also found that prompting the models to be more accurate didn’t help—if anything, it made things worse. When models were instructed to “avoid inaccuracies,” they were nearly twice as likely to produce generalized statements compared to when they were simply asked to summarize the text. One explanation for this counterintuitive result may relate to how the models interpret prompts. Much like the human tendency to fixate on a thought when told not to think about it, the models may respond to reminders about accuracy by producing more authoritative-sounding—but misleading—summaries.

In addition to comparing chatbot summaries to the original research, the study also looked at how the models performed compared to human science writers. Specifically, the researchers compared model-generated summaries of medical research to expert-written summaries published in NEJM Journal Watch. They found that the human-authored summaries were much less likely to contain overgeneralizations. In fact, the chatbot-generated summaries were nearly five times more likely to broaden the scope of scientific conclusions beyond what the original studies supported.

Another interesting finding was the role of model settings. When researchers used an “API” to generate summaries with the temperature setting at 0—a value that makes the model more deterministic and less creative—the likelihood of overgeneralization dropped significantly. This suggests that controlling certain technical parameters can help reduce errors, though this option may not be available to everyday users accessing chatbots through standard web interfaces.

The researchers point out that not all generalizations are inherently bad. Sometimes, simplifying complex findings can make science more accessible, especially for non-experts. But when these generalizations go beyond the evidence, they can mislead readers and even pose risks. This is particularly concerning in high-stakes fields like medicine, where overstated claims could affect clinical decisions.

While the study focused on overgeneralizations, the researchers acknowledged that undergeneralizations can also occur. A model might turn a broadly supported finding into a narrowly worded summary, potentially downplaying important results. However, these instances were far less common than the overgeneralizations, which were the main focus of the research.

This study stands out not only for its scale and thoroughness but also for offering a clear framework to evaluate how well language models preserve the scope of scientific conclusions. The researchers suggest that developers and users of language models adopt several strategies to reduce the risk of misleading summaries. These include using models with more conservative settings, avoiding prompts that explicitly demand accuracy, and choosing systems like Claude that have shown greater fidelity to the original texts.

But the study has some limitations. It only tested a few prompt types, and it focused largely on medical research, which may not generalize to all scientific fields. The human-written summaries used for comparison were produced by expert audiences and may not reflect the kinds of summaries written for the general public. Future studies might explore how different prompting strategies or model configurations influence performance across a wider range of scientific disciplines.

The study, “Generalization bias in large language model summarization of scientific research,” was authored by Uwe Peters and Benjamin Chin-Yee.