Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI chatbots outperform humans in evaluating social situations, study finds

by Eric W. Dolan
December 3, 2024
in Artificial Intelligence
(Photo credit: Adobe Stock)

(Photo credit: Adobe Stock)

Share on TwitterShare on Facebook
Don't miss out! Follow PsyPost on Bluesky!

Recent research published in Scientific Reports has found that certain advanced AI chatbots are more adept than humans at making judgments in challenging social situations. Using a well-established psychological tool known as a Situational Judgment Test, researchers found that three chatbots—Claude, Microsoft Copilot, and you.com’s smart assistant—outperformed human participants in selecting the most effective behavioral responses.

The ability of AI to assist in social interactions is becoming increasingly relevant, with applications ranging from customer service to mental health support. Large language models, such as the chatbots tested in this study, are designed to process language, understand context, and provide helpful responses. While previous studies have demonstrated their capabilities in academic reasoning and verbal tasks, their effectiveness in navigating complex social dynamics has remained underexplored.

Large language models are advanced artificial intelligence systems designed to understand and generate human-like text. These models are trained on vast amounts of data—books, articles, websites, and other textual sources—allowing them to learn patterns in language, context, and meaning.

This training enables these models to perform a variety of tasks, from answering questions and translating languages to composing essays and engaging in detailed conversations. Unlike earlier AI models, large language models rely on their ability to process context and generate responses that often feel conversational and relevant to the user’s input.

“As researchers, we are interested in the diagnostics of social competence and interpersonal skills,” said study author Justin M. Mittelstädt of the Institute of Aerospace Medicine.

“At the German Aerospace Center, we apply methods for diagnosing these skills, for example, to find suitable pilots and astronauts. As we are exploring new technologies for future human-machine interaction, we were curious to find out how the emerging large language models perform in these areas that are considered to be profoundly human.”

To evaluate AI performance, the researchers used a Situational Judgment Test, a tool widely used in psychology and personnel assessment to measure social competence. The test presented 12 scenarios requiring participants to evaluate four potential courses of action. For each scenario, participants were tasked with identifying the best and worst responses, as rated by a panel of 109 human experts.

The study compared the performance of five AI chatbots—Claude, Microsoft Copilot, ChatGPT, Google Gemini, and you.com’s smart assistant—with a sample of 276 human participants. These human participants were pilot applicants selected for their high educational qualifications and motivation. Their performance provided a rigorous benchmark for the AI systems.

Each chatbot completed the Situational Judgment Test ten times, with randomized presentation orders to ensure consistent results. The responses were then scored based on how well they aligned with the expert-identified best and worst options. In addition to choosing responses, the chatbots were asked to rate the effectiveness of each action in the scenarios, providing further data for comparison with expert evaluations.

The researchers found that all the tested AI chatbots performed at least as well as the human participants, with some outperforming them. Among the chatbots, Claude achieved the highest average score, followed by Microsoft Copilot and you.com’s smart assistant. These three systems consistently selected the most effective responses in the Situational Judgment Test scenarios, aligning closely with expert evaluations.

Interestingly, when chatbots failed to select the best response, they most often chose the second-most effective option, mirroring the decision-making patterns of human participants. This suggests that AI systems, while not perfect, are capable of nuanced judgment and probabilistic reasoning that closely resembles human thought processes.

“We have seen that these models are good at answering knowledge questions, writing code, solving logic problems, and the like,” Mittelstädt told PsyPost. “But we were surprised to find that some of the models were also, on average, better at judging the nuances of social situations than humans, even though they had not been explicitly trained for use in social settings. This showed us that social conventions and the way we interact as humans are encoded as readable patterns in the textual sources on which these models are trained.”

The study also highlighted differences in reliability among the AI systems. Claude showed the highest consistency across multiple test iterations, while Google Gemini exhibited occasional contradictions, such as rating an action as both the best and worst in different runs. Despite these inconsistencies, the overall performance of all tested AI systems surpassed expectations, demonstrating their potential to provide socially competent advice.

“Many people already use chatbots for a variety of everyday tasks,” Mittelstädt explained. “Our results suggest that chatbots may be quite good at giving advice on how to behave in tricky social situations and that people, especially those who are insecure in social interactions, may benefit from this. However, we do not recommend blindly trusting chatbots, as we also saw evidence of hallucinations and contradictory statements, as is often reported in the context of large language models.”

It is important to note that the study focused on simulated scenarios rather than real-world interactions, leaving questions about how AI systems might perform in dynamic, high-stakes social settings.

“To facilitate a quantifiable comparison between large language models and humans, we selected a multiple-choice test that demonstrates prognostic validity in humans for real-world behavior,” Mittelstädt noted. “However, performance on such a test does not yet guarantee that large language models will respond in a socially competent manner in real and more complex scenarios.”

Nevertheless, the findings suggest that AI systems are increasingly able to emulate human social judgment. These advancements open doors to practical applications, including personalized guidance in social and professional settings, as well as potential use in mental health support.

“Given the demonstrated ability of large language models to judge social situations effectively in a psychometric test, our objective is to assess their social competence in real-world interactions with people and the conditions under which people benefit from social advice provided by a large language model,” Mittelstädt told PsyPost.

“Furthermore, the response behavior in Situational Judgment Tests is highly culture-dependent. The effectiveness of a response in a specific situation may vary considerably from one culture to another. The good performance of large language models in our study demonstrates that they align closely with the judgments prevalent in Western cultures. It would be interesting to see how large language models perform in tests from other cultural contexts and whether their evaluation would change if they were trained with more data from a different culture.”

“Even though large language models may produce impressive performances in social tasks, they do not possess emotions, which would be a prerequisite for genuine social behavior,” Mittelstädt added. “We should keep in mind that large language models only imitate social responses that they have extracted from patterns in their training dataset. Despite this, there are promising applications, such as assisting individuals with social skills development.”

The study, “Large language models can outperform humans in social situational judgments,” was authored by Justin M. Mittelstädt, Julia Maier, Panja Goerke, Frank Zinn, and Michael Hermes.

TweetSendScanShareSendPin1ShareShareShareShareShare

RELATED

Generative AI chatbots like ChatGPT can act as an “emotional sanctuary” for mental health
Artificial Intelligence

Do AI tools undermine our sense of creativity? New study says yes

June 19, 2025

A new study published in The Journal of Creative Behavior offers insight into how people think about their own creativity when working with artificial intelligence.

Read moreDetails
Dark personality traits and specific humor styles are linked to online trolling, study finds
Artificial Intelligence

Memes can serve as strong indicators of coming mass violence

June 15, 2025

A new study finds that surges in visual propaganda—like memes and doctored images—often precede political violence. By combining AI with expert analysis, researchers tracked manipulated content leading up to Russia’s invasion of Ukraine, revealing early warning signs of instability.

Read moreDetails
Teen depression tied to balance of adaptive and maladaptive emotional strategies, study finds
Artificial Intelligence

Sleep problems top list of predictors for teen mental illness, AI-powered study finds

June 15, 2025

A new study using data from over 11,000 adolescents found that sleep disturbances were the most powerful predictor of future mental health problems—more so than trauma or family history. AI models based on questionnaires outperformed those using brain scans.

Read moreDetails
New research links certain types of narcissism to anti-immigrant attitudes
Artificial Intelligence

Fears about AI push workers to embrace creativity over coding, new research suggests

June 13, 2025

A new study shows that when workers feel threatened by artificial intelligence, they tend to highlight creativity—rather than technical or social skills—in job applications and education choices. The research suggests people see creativity as a uniquely human skill machines can’t replace.

Read moreDetails
Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

A neuroscientist explains why it’s impossible for AI to “understand” language

June 12, 2025

Can artificial intelligence truly “understand” language the way humans do? A neuroscientist challenges this popular belief, arguing that machines may generate convincing text—but they lack the emotional, contextual, and biological grounding that gives real meaning to human communication.

Read moreDetails
Scientists reveal ChatGPT’s left-wing bias — and how to “jailbreak” it
Artificial Intelligence

ChatGPT mimics human cognitive dissonance in psychological experiments, study finds

June 3, 2025

OpenAI’s GPT-4o demonstrated behavior resembling cognitive dissonance in a psychological experiment. After writing essays about Vladimir Putin, the AI changed its evaluations—especially when it thought it had freely chosen which argument to make, echoing patterns seen in people.

Read moreDetails
Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

East Asians more open to chatbot companionship than Westerners

May 30, 2025

A new study highlights cultural differences in attitudes toward AI companionship. East Asian participants were more open to emotionally connecting with chatbots, a pattern linked to greater anthropomorphism and differing exposure to social robots across regions.

Read moreDetails
AI can predict intimate partner femicide from variables extracted from legal documents
Artificial Intelligence

Being honest about using AI can backfire on your credibility

May 29, 2025

New research reveals a surprising downside to AI transparency: people who admit to using AI at work are seen as less trustworthy. Across 13 experiments, disclosing AI use consistently reduced credibility—even among tech-savvy evaluators and in professional contexts.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

Exposure to heavy metals is associated with higher likelihood of ADHD diagnosis

Eye-tracking study shows people fixate longer on female aggressors than male ones

Romantic breakups follow a two-stage decline that begins years before the split, study finds

Believing “news will find me” is linked to sharing fake news, study finds

A common parasite not only invades the brain — it can also decapitate human sperm

Almost all unmarried pregant women say that the fetus resembles the father, study finds

New neuroscience research reveals brain antioxidant deficit in depression

Scientists uncover kidney-to-brain route for Parkinson’s-related protein spread

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy