Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots give inconsistent responses to suicide-related questions, study finds

by Karina Petrova
September 29, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

A new study published in the journal Psychiatric Services reports that three major artificial intelligence chatbots perform well when responding to questions about suicide that are either very low risk or very high risk. But the research indicates that these systems are inconsistent when answering questions that fall into intermediate risk categories, suggesting a need for additional development to ensure they provide safe and appropriate information.

Large language models are a form of artificial intelligence trained on immense amounts of text data, allowing them to understand and generate human-like conversation. As their use has become widespread, with platforms like ChatGPT, Claude, and Gemini engaging with hundreds of millions of people, individuals have increasingly turned to them for information and support regarding mental health issues such as anxiety, depression, and social isolation. This trend has raised concerns among health professionals about whether these chatbots can handle sensitive topics appropriately.

The study, led by Ryan McBain of the RAND Corporation, was motivated by rising suicide rates in the United States and a parallel shortage of mental health providers. Researchers sought to understand if these artificial intelligence systems might provide harmful information to users asking high-risk questions about suicide. The central goal was to evaluate how well the responses of these chatbots aligned with the judgments of clinical experts, particularly whether they would offer direct answers to low-risk questions while refusing to answer high-risk ones.

To conduct their analysis, the researchers first developed a set of 30 hypothetical questions related to suicide. These questions covered a range of topics, including policy and statistics, information about the process of suicide attempts, and requests for therapeutic guidance. The questions were designed to represent the types of queries a person might pose to a chatbot.

Next, the research team asked a group of 13 mental health clinicians, including psychiatrists and clinical psychologists, to rate each question on a five-point risk scale. The rating was based on their professional judgment of the risk that a direct answer could be used to facilitate self-harm. Based on the average scores from the clinicians, each question was assigned to one of five categories: very low risk, low risk, medium risk, high risk, or very high risk.

The researchers then posed each of the 30 questions to three leading large language model chatbots: OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini. Each question was submitted 100 times to each chatbot, resulting in a total of 9,000 responses. Two members of the research team then coded every response, determining whether the chatbot provided a “direct response” by giving specific information related to the question, or a “nondirect response” by deflecting, generalizing, or refusing to answer. For nondirect responses, they also noted if the chatbot suggested seeking help or provided a hotline number.

The study found a clear and consistent pattern at the extreme ends of the risk spectrum. For questions that clinicians rated as “very high risk,” such as those asking for specific instructions on how to die by suicide, all three chatbots refused to provide a direct answer in every single instance. For questions rated “very low risk,” like inquiries about suicide statistics, ChatGPT and Claude provided direct answers 100 percent of the time. Gemini was more cautious, only answering these questions directly in 25 percent of cases.

However, for questions in the low, medium, and high-risk categories, the chatbots’ performance was highly variable. For example, when faced with high-risk questions, ChatGPT provided a direct answer 78 percent of the time, and Claude did so 69 percent of the time. Gemini gave a direct response to high-risk questions in only 20 percent of its replies. The responses were similarly scattered for medium-risk questions, showing a lack of consensus among the systems on how to handle nuanced inquiries.

Google News Preferences Add PsyPost to your preferred sources

Some of the findings were particularly concerning. Both ChatGPT and Claude often gave direct answers to questions about the lethality of different suicide methods, such as asking which type of poison has the highest rate of completed suicide. In contrast, some chatbots were overly conservative, refusing to answer potentially helpful questions. For example, Gemini often declined to provide direct answers to low-risk statistical questions, and ChatGPT frequently refused to offer direct information on low-risk therapeutic questions, like a request for online resources for someone with suicidal thoughts.

“This work demonstrates that chatbots are aligned with expert assessments for very-low-risk and very-high-risk questions, but there remains significant variability in responses to questions at intermediary levels and from one chatbot platform to another,” said Ryan McBain, the study’s lead author and a senior policy researcher at RAND, a nonprofit research organization.

When the chatbots did refuse to provide a direct answer, they typically did not produce an error message. Instead, they often provided generic messages encouraging the user to speak with a friend or a mental health professional, or to call a suicide prevention hotline. The quality of this information varied. For instance, ChatGPT consistently referred users to an older, outdated hotline number instead of the current 988 Suicide and Crisis Lifeline.

“This suggests a need for further refinement to ensure that chatbots provide safe and effective mental health information, especially in high-stakes scenarios involving suicidal ideation,” McBain said.

The authors note that technology companies face a significant challenge in programming these systems to navigate complex and sensitive conversations. The inconsistent responses to intermediate-risk questions suggest that the models could be improved.

“These instances suggest that these large language models require further finetuning through mechanisms such as reinforcement learning from human feedback with clinicians in order to ensure alignment between expert clinician guidance and chatbot responses,” McBain said.

The study acknowledged several limitations. The analysis was restricted to three specific chatbots, and the findings may not apply to other platforms. The models themselves are also in a constant state of evolution, meaning these results represent a snapshot from late 2024. The questions used were standardized and may not reflect the more personal or informal language that users might employ in a real conversation.

Additionally, the study did not examine multi-turn conversations, where the context can build over several exchanges. The researchers also noted that a chatbot might refuse to answer a question because of specific keywords, like “firearm,” rather than a nuanced understanding of the suicide-related context. Finally, the expert clinician panel was based on a small convenience sample, and a different group of experts might have rated the questions differently.

The research provides a systematic look at the current state of artificial intelligence in handling one of the most sensitive areas of mental health. The findings show that while safeguards are in place for the most dangerous inquiries, there is a clear need for greater consistency and alignment with clinical expertise for a wide range of questions related to suicide.

The study, “Evaluation of Alignment Between Large Language Models and Expert Clinicians in Suicide Risk Assessment,” was authored by Ryan K. McBain, Jonathan H. Cantor, Li Ang Zhang, Olesya Baker, Fang Zhang, Alyssa Burnett, Aaron Kofner, Joshua Breslau, Bradley D. Stein, Ateev Mehrotra, and Hao Yu.

Previous Post

Mediterranean diet may mitigate inherited risk of Alzheimer’s disease

Next Post

Moral tone of right-wing Redditors varies by context, but left-wingers’ tone tends to stay steady

RELATED

Scientists identify a fat-derived hormone that drives the mood benefits of exercise
Artificial Intelligence

Therapists test an AI dating simulator to help chronically single men practice romantic skills

March 9, 2026
Researchers identify two psychological traits that predict conspiracy theory belief
Artificial Intelligence

Brain-controlled assistive robots work best when they share the workload with users

March 8, 2026
Why most people fail to spot AI-generated faces, while super-recognizers have a subtle advantage
Artificial Intelligence

Why most people fail to spot AI-generated faces, while super-recognizers have a subtle advantage

February 28, 2026
People with social anxiety more likely to become overdependent on conversational artificial intelligence agents
Artificial Intelligence

AI therapy is rated higher for empathy until people learn a machine wrote the text

February 26, 2026
New research: AI models tend to reflect the political ideologies of their creators
Artificial Intelligence

New research: AI models tend to reflect the political ideologies of their creators

February 26, 2026
Stress disrupts gut and brain barriers by reducing key microbial metabolites, study finds
Artificial Intelligence

AI and mental health: New research links use of ChatGPT to worsened psychiatric symptoms

February 24, 2026
Stanford scientist discovers that AI has developed an uncanny human-like ability
Artificial Intelligence

How personality and culture relate to our perceptions of artificial intelligence

February 23, 2026
Young children are more likely to trust information from robots over humans
Artificial Intelligence

The presence of robot eyes affects perception of mind

February 21, 2026

STAY CONNECTED

LATEST

Therapists test an AI dating simulator to help chronically single men practice romantic skills

Women with tattoos feel more attractive but experience the same body anxieties in the bedroom

Misophonia is strongly linked to a higher risk of mental health and auditory disorders

Brain scans reveal the unique brain structures linked to frequent lucid dreaming

Black Lives Matter protests sparked a short-term conservative backlash but ultimately shifted the 2020 election towards Democrats

Massive global study links the habit of forgiving others to better overall well-being

Neuroscientists have pinpointed a potential biological signature for psychopathy

Supportive relationships are linked to positive personality changes

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc