PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots tend to overdiagnose mental health conditions when used without structured guidance

by Eric W. Dolan
January 22, 2026
Reading Time: 6 mins read
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

A new study published in Psychiatry Research suggests that while large language models are capable of identifying psychiatric diagnoses from clinical descriptions, they are prone to significant overdiagnosis when operating without structured guidance. By integrating expert-derived decision trees into the diagnostic process, researchers from the University of California San Francisco found they could improve the precision of these artificial intelligence models and reduce the rate of incorrect false positives.

The rapid development of artificial intelligence has led to increased interest in its potential applications within healthcare. Large language models like OpenAI’s ChatGPT have shown an ability to process and generate complex text, which has raised the possibility of their use in mental health settings for tasks such as decision support or documentation.

Many patients are already accessing these public tools to interpret their own symptoms and seek medical advice. However, these models are trained on vast datasets from the internet rather than specific medical curricula. This training method means the models function based on statistical probability and linguistic patterns rather than genuine clinical understanding.

There is a concern that without specific medical training or guardrails, these general-purpose models might generate inaccurate or harmful advice. The ability of a computer program to produce coherent text does not necessarily equate to the ability to perform the complex reasoning required for a psychiatric diagnosis.

The authors of the new study sought to evaluate whether generic large language models could effectively reason about mental health cases. They also aimed to determine if feeding the models specific, expert-created rules could enhance their accuracy and safety.

“There has been considerable interest in using Large Language Model (LLM)-based technologies to build clinical and research tools for behavioral health. Additionally, individuals are increasingly using LLM-based chatbots (such as ChatGPT, Claude, Gemini, etc.) as health information tools and for emotional support,” explained study author Karthik V. Sarma, who founded the UCSF AI in Mental Health Research Group within the Department of Psychiatry and Behavioral Sciences at UCSF.

“We were interested in seeing how well these LLMs worked in our field, and chose vignette diagnosis as an example problem for evaluation. We also wanted to know if we could improve the performance of the models by constraining them to use reasoning pathways (decision trees) designed by psychiatric experts.”

To conduct this investigation, the researchers utilized a set of 93 clinical case vignettes drawn from the DSM-5-TR Clinical Cases book. These vignettes serve as standardized examples of patients with specific psychiatric conditions, such as depression, bipolar disorder, or schizophrenia. The team divided these cases into a training set, which was used to refine their prompting strategies, and a testing set, which was used to evaluate the final performance of the models. They tested three versions of the GPT family of models: GPT-3.5, GPT-4, and GPT-4o.

Google News Preferences Add PsyPost to your preferred sources

The researchers designed two distinct experimental approaches to test the models. The first was a “Base” approach, where the artificial intelligence was simply given the clinical story and asked to predict the most likely diagnoses. This method mimics how a casual user might interact with a chatbot by describing symptoms and asking for an opinion. The second method was a “Decision Tree” approach. This involved adapting the logic from the DSM-5-TR Handbook of Differential Diagnosis, a professional guide that uses branching logic to rule conditions in or out.

In the Decision Tree approach, the researchers did not ask the model for a diagnosis directly. Instead, they converted the expert logic into a series of “yes” or “no” questions. The model was prompted to answer these questions based on the case vignette.

For example, the model might be asked if a patient was experiencing a specific symptom for a certain duration. The answers to these sequential questions would then lead the system down a path toward a potential diagnosis. This method forced the model to follow a step-by-step reasoning process similar to that of a trained clinician.

The results showed a clear distinction between the two methods. When the models were directly prompted to guess the diagnosis in the Base approach, they demonstrated high sensitivity. The most advanced model, GPT-4o, correctly identified the author-designated diagnosis in approximately 77 percent of the cases. This indicates that the models are quite good at picking up on the presence of a disorder based on the text.

However, this high sensitivity came at the cost of precision. The Base approach resulted in a low positive predictive value of roughly 40 percent. This metric reveals that the models were casting too wide a net. They frequently assigned diagnoses that were not present in the vignettes.

On average, the base models produced more than one incorrect diagnosis for every correct one. This tendency toward overdiagnosis represents a significant risk, as it could lead to patients believing they have conditions they do not actually possess.

“This suggests to everyone that diagnoses generated by generalist chatbots may not be accurate, and it is important to consult with a health professional,” Sarma told PsyPost.

The implementation of the Decision Tree approach yielded different results. By forcing the models to adhere to expert reasoning structures, the researchers increased the positive predictive value to approximately 65 percent. This improvement means that when the system suggested a diagnosis, it was much more likely to be correct. The rate of overdiagnosis dropped compared to the direct prompting method.

There was a trade-off associated with this increased precision. The sensitivity of the Decision Tree approach was slightly lower than that of the Base approach, coming in at around 71 percent. This suggests that the strict rules of the decision trees occasionally caused the model to miss a diagnosis that the more open-ended method might have caught. Despite this slight drop in sensitivity, the overall performance as measured by the F1 statistic—a metric that balances precision and recall—was generally higher for the Decision Tree approach.

The study also highlighted the importance of refining the prompts used to guide the artificial intelligence. During the training phase, the researchers found that the models sometimes misunderstood medical terminology or the structure of the decision trees. For instance, the models initially struggled to differentiate between “substance use” and medical side effects, or they would misinterpret clinical terms like “ego-dystonic.” The researchers had to iteratively refine their questions to ensure the models interpreted the clinical criteria correctly.

The findings provide evidence that generalist large language models possess an emergent capability for psychiatric reasoning. Performance improved with each successive generation of the model, with GPT-4 and GPT-4o outperforming the older GPT-3.5. This trajectory suggests that as these models continue to evolve, their capacity for handling complex medical tasks may increase.

“Practically speaking, the reduction in overdiagnosis using our decision trees was significant,” Sarma explained. “However, the task we used (vignette diagnosis) is a much easier task than real-world diagnosis. I would expect performance at this stage to be much worse in the real world, and we are still working on methods to address this problem. For now, I do not believe that these generalist models are ready for use as mental health support agents, though there may be other specialist models that are more capable.”

The tendency for overdiagnosis observed in the Base approach is particularly relevant for the general public. Individuals using chatbots for self-diagnosis should be aware that these systems may be biased toward finding pathology where none exists. The study suggests that while artificial intelligence can be a powerful tool for analyzing behavioral health data, it works best when constrained by expert medical knowledge and validated guidelines.

“It was not our goal to produce an actual clinical tool that is ready to use, and that was not the outcome of our work,” Sarma noted. “Instead, we focused on investigation how well current models work, and on whether or not our idea to integrate the current models with expert guidelines was helpful. We hope our finding can be used to develop better real-world tools in the future.”

Future research will need to focus on testing these systems with real-world patient data to see if the findings hold up in clinical practice. The authors also suggest that future work could explore using these models to identify new diagnostic patterns or language-based phenotypes that go beyond current classifications. For now, the integration of expert reasoning appears to be a necessary step in making these powerful tools safer and more accurate for potential psychiatric applications.

“We are now working on developing systems that can operate on real-world data, and measuring the impact of different methods in this setting,” Sarma explained. “We’re also working on better understanding how the use of chatbots by people with diagnosed mental illnesses impact their health (see more here).”

The study, “Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis,” was authored by Karthik V. Sarma, Kaitlin E. Hanss, Andrew J. M. Halls, Andrew Krystal, Daniel F. Becker, Anne L. Glowinski, and Atul J. Butte.

RELATED

In shock discovery, scientists link mother’s childhood trauma to specific molecules in her breast milk
Alcohol

Even light drinking combined with aging is linked to reduced brain blood flow and thinner tissue

April 23, 2026
New research sheds light on how men and women differ in concerns about sexual addiction
Mental Health

The age you start regularly watching adult content predicts your future mental health

April 22, 2026
Facebook users who ruminate and compare themselves to their friends experience increased loneliness
Artificial Intelligence

Women perceive AI as riskier than men do, study finds

April 22, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

Psychologists pinpoint the conversational mechanisms that help humans bond with AI

April 22, 2026
Biomarkers in spinal fluid may flag frontotemporal dementia before symptoms emerge
Mental Health

Everyday infections, not vaccines, are linked to an increased risk of childhood stroke

April 22, 2026
Secure attachment linked to lower PTSD symptoms in children, study finds
Early Life Adversity and Childhood Maltreatment

Childhood adversity predicts combined physical and mental illness in later life

April 21, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

Unrestricted generative AI harms high school math learning by acting as a crutch

April 21, 2026
Building muscle strength may help prevent depression, especially in women
Mental Health

Lifting weights builds a sharper mind and reduces anxiety in older women

April 20, 2026

STAY CONNECTED

RSS Psychology of Selling

  • When salespeople feel free and connected to their boss, they’re less likely to quit
  • Want your brand to look premium? New research suggests making your logo less dynamic
  • The color trick that changes how you expect products to smell, taste, and feel
  • A new framework maps how influencers, brands, and platforms all compete for long-term value
  • Why personalized ads sometimes backfire: A research review explains when tailoring messages works and when it doesn’t

LATEST

Even light drinking combined with aging is linked to reduced brain blood flow and thinner tissue

Female leaders command equal obedience in a modern replication of the Milgram experiment

Neuroscientists identify brain regions that drive curiosity for what might have been

The age you start regularly watching adult content predicts your future mental health

Women perceive AI as riskier than men do, study finds

Do we drink because we feel down, or feel down because we drink? A new study has the answer

Psychologists pinpoint the conversational mechanisms that help humans bond with AI

Manipulative people use both kindness and gossip as separate tools to control their social circles

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc