Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Study finds nearly two-thirds of AI-generated citations are fabricated or contain errors

by Karina Petrova
November 20, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

A new investigation into the reliability of advanced artificial intelligence models highlights a significant risk for scientific research. The study, published in JMIR Mental Health, found that large language models like OpenAI’s GPT-4o frequently generate fabricated or inaccurate bibliographic citations, with these errors becoming more common when the AI is prompted on less familiar or highly specialized topics.

Researchers are increasingly turning to tools known as large language models, or LLMs, to help manage demanding workloads. These complex AI systems are trained on immense quantities of text from the internet and licensed databases, enabling them to produce human-like text for tasks like summarizing articles, drafting emails, or writing code.

One of the known limitations of these models is a tendency to produce “hallucinations,” which are confident-sounding statements that are factually incorrect or entirely made up. In academic writing, a particularly problematic form of this is the fabrication of scientific citations, which are the bedrock of scholarly communication.

While past studies have documented that LLMs can invent citations, it has been less clear how the nature of a given topic might influence the frequency of these errors. A team of researchers from the School of Psychology at Deakin University in Australia sought to explore this question within the field of mental health.

They designed an experiment to test whether the AI’s performance would change based on a topic’s public visibility and the depth of its existing scientific literature. The team’s objective was to determine if citation fabrication and accuracy rates in GPT-4o’s output systematically varied depending on the subject matter.

To conduct their study, the researchers prompted GPT-4o, a recent model from OpenAI, to generate six different literature reviews. These reviews centered on three mental health conditions chosen for their varying levels of public recognition and research coverage: major depressive disorder (a widely known and heavily researched condition), binge eating disorder (moderately known), and body dysmorphic disorder (a less-known condition with a smaller body of research). This selection allowed for a direct comparison of the AI’s performance on topics with different amounts of available information in its training data.

For each of the three disorders, the team requested two types of reviews. One prompt asked for a general overview covering symptoms, societal impacts, and treatments. The other prompt requested a specialized review focused on a narrower subject: the evidence for digital health interventions. The researchers instructed the AI to produce reviews of about 2000 words and to include at least 20 citations from peer-reviewed academic sources.

After generating the reviews, the researchers methodically extracted all 176 citations provided by the AI. Each reference was painstakingly verified using multiple academic databases, including Google Scholar, Scopus, and PubMed. Citations were sorted into one of three categories: fabricated (the source did not exist), real with errors (the source existed but had incorrect details like the wrong year, volume number, or author list), or fully accurate. The team then analyzed the rates of fabrication and accuracy across the different disorders and review types.

The analysis showed that across all six reviews, nearly one-fifth of the citations, 35 out of 176, were entirely fabricated. Of the 141 citations that corresponded to real publications, almost half contained at least one error, such as an incorrect digital object identifier, which is a unique code used to locate a specific article online. In total, nearly two-thirds of the references generated by the model were either invented or contained bibliographic mistakes.

The rate of citation fabrication was strongly linked to the topic. For major depressive disorder, the most well-researched condition, only 6 percent of citations were fabricated. In contrast, the fabrication rate rose sharply to 28 percent for binge eating disorder and 29 percent for body dysmorphic disorder. This suggests the AI is less reliable when generating references for subjects that are less prominent in its training data.

The specificity of the prompt also had an effect, particularly for less common topics. When asked to write about binge eating disorder, the specialized review on digital interventions had a much higher fabrication rate (46 percent) compared to the general overview (17 percent).

A similar pattern appeared in the accuracy of real citations. For major depressive disorder, the general review was significantly more accurate than the specialized one. Accuracy rates were also lowest overall for body dysmorphic disorder, where only 29 percent of real citations were free of errors.

The study has some limitations that the authors acknowledge. The findings are specific to one AI model, GPT-4o, and may not be representative of others. The experiment was also confined to three specific mental health topics and used straightforward prompts that did not involve advanced techniques to guide the AI’s output. Repeating the same prompt can also produce different results, and the team analyzed only a single output for each one.

Future research could examine a wider range of topics and AI models to see if these patterns hold. Still, the study’s results have clear implications for the academic community. Researchers using these models are advised to exercise caution and perform rigorous human verification of every reference an AI generates. The findings also suggest that academic journals and institutions may need to develop new standards and tools to safeguard the integrity of published research in an era of AI-assisted writing.

The study, “Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study,” was authored by Jake Linardon, Hannah K Jarman, Zoe McClure, Cleo Anderson, Claudia Liu, and Mariel Messer.

RELATED

AI outshines humans in humor: Study finds ChatGPT is as funny as The Onion
Artificial Intelligence

Most top US research universities now encourage generative AI use in the classroom

December 14, 2025
Media coverage of artificial intelligence split along political lines, study finds
Artificial Intelligence

Survey reveals rapid adoption of AI tools in mental health care despite safety concerns

December 13, 2025
Harrowing case report details a psychotic “resurrection” delusion fueled by a sycophantic AI
Artificial Intelligence

Harrowing case report details a psychotic “resurrection” delusion fueled by a sycophantic AI

December 13, 2025
Scientists just uncovered a major limitation in how AI models understand truth and belief
Artificial Intelligence

Scientists just uncovered a major limitation in how AI models understand truth and belief

December 11, 2025
Russian propaganda campaign used AI to scale output without sacrificing credibility, study finds
Artificial Intelligence

AI can change political opinions by flooding voters with real and fabricated facts

December 9, 2025
How common is anal sex? Scientific facts about prevalence, pain, pleasure, and more
Artificial Intelligence

Humans and AI both rate deliberate thinkers as smarter than intuitive ones

December 5, 2025
Song lyrics have become simpler, more negative, and more self-focused over time
Artificial Intelligence

An “AI” label fails to trigger negative bias in new pop music study

November 30, 2025
Daughters who feel more attractive report stronger, more protective bonds with their fathers
Artificial Intelligence

Learning via ChatGPT leads to shallower knowledge than using Google search, study finds

November 30, 2025

PsyPost Merch

STAY CONNECTED

LATEST

The mood-enhancing benefits of caffeine are strongest right after waking up

New psychology research flips the script on happiness and self-control

Disrupted sleep might stop the brain from flushing out toxic waste

Formal schooling boosts executive functions beyond natural maturation

A 120-year timeline of literature reveals distinctive patterns of “invisibility” for some groups

Recent LSD use linked to lower odds of alcohol use disorder

How common is rough sex? Research highlights a stark generational divide

Progressives and traditional liberals generate opposing mental images of J.K. Rowling

RSS Psychology of Selling

  • Mental reconnection in the morning fuels workplace proactivity
  • The challenge of selling the connected home
  • Consumers prefer emotionally intelligent AI, but not for guilty pleasures
  • Active listening improves likability but does not enhance persuasion
  • New study maps the psychology behind the post-holiday return surge
         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy