Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Stunning AI discovery: GPT-4 often matches or surpasses humans in Theory of Mind tests

by Eric W. Dolan
June 12, 2024
in Artificial Intelligence
(Photo credit: OpenAI's DALL·E)

(Photo credit: OpenAI's DALL·E)

Share on TwitterShare on Facebook
Stay on top of the latest psychology findings: Subscribe now!

Researchers from various fields have long been fascinated by the human capacity for theory of mind – our ability to understand and predict the mental states of others. This capacity underpins much of our social interactions, from interpreting indirect requests to detecting deception.

Recently, a study published in Nature Human Behaviour revealed that advanced artificial intelligence (AI) models, particularly large language models like OpenAI’s GPT-4, demonstrate notable competence in performing tasks designed to test theory of mind. GPT-4 often matched or even surpassed human performance in understanding indirect requests, false beliefs, and misdirection, but it struggled with detecting faux pas.

Large language models (LLMs) are built using deep learning techniques and trained on vast amounts of text data. They function by predicting the next word in a sequence, allowing them to generate coherent and contextually appropriate text based on the input they receive.

The training process involves exposing the model to diverse linguistic patterns, enabling it to learn grammar, facts about the world, and even some elements of reasoning and inference. LLMs have shown remarkable capabilities in various tasks, including language translation, summarization, and conversation, making them powerful tools for a wide range of applications.

Despite their impressive performance, LLMs are not without limitations. Their ability to mimic human language has led to questions about whether they truly understand the content they generate or if they are simply regurgitating patterns learned during training. This distinction is particularly important when considering tasks that require a deep understanding of context and human psychology, such as those involving theory of mind.

“Theory of Mind is an important aspect of human cognition as it allows us to navigate our social environment easily and efficiently by tracking the mental states of people around us,” explained study author James Strachan, a Humboldt Research Fellow at the University Medical Center Hamburg-Eppendorf.

“Given the importance of this capacity for social interactions among humans, this is also a key consideration in the ongoing development of AI technologies that aim to allow for fluent human-AI interactions. Evaluating how well AIs (such as the LLMs we tested) can engage in mentalistic inference (that is, drawing conclusions about people’s mental states from their behavior) requires a systematic approach with comparison against human samples.”

To rigorously evaluate the theory of mind capabilities of large language models (LLMs), the researchers designed a study involving various tasks that test different aspects of this cognitive ability. The study primarily focused on comparing the performance of GPT-4, its predecessor GPT-3.5, and another language model known as LLaMA2-70B against human participants.

The researchers selected a battery of well-established psychological tests that are typically used to assess theory of mind in humans. These tests included false belief tasks, irony comprehension, faux pas detection, hinting tasks, and strange stories.

The human participants were recruited online through the Prolific platform, ensuring they were native English speakers aged between 18 and 70 years, with no history of psychiatric conditions or dyslexia. A total of 1,907 participants were involved in the study, with specific numbers allocated to each test. Each AI model was tested across 15 independent sessions per task, ensuring that each session simulated a naive participant, as the models do not retain memory across different chat sessions.

The researchers found that GPT-4 demonstrated impressive theory of mind capabilities, matching or exceeding human performance in some tasks.

“It is surprising that these models can engage in such sophisticated social reasoning without the direct embodied experience that typifies human development,” Strachan told PsyPost. “The fact that these models, which are trained extensively on the statistics of natural language through the use of large data sets, can solve these tasks indicates that a lot of how we think and reason about others is encoded in the language that we use and, excitingly, can be reconstructed in part from the structure of this language.”

False Belief Tasks

In the false belief tasks, participants were presented with scenarios where a character’s belief about the world differed from reality. Both GPT-4 and GPT-3.5 performed at ceiling levels, correctly predicting where a character would look for an object based on their false belief, much like human participants. This task measures the ability to inhibit one’s own knowledge and predict others’ actions based on their mental states, a fundamental aspect of theory of mind.

Irony Comprehension

Irony comprehension required participants to interpret statements where the intended meaning was the opposite of the literal meaning. GPT-4 excelled in this task, outperforming human participants by accurately identifying ironic remarks more frequently. This reflects an advanced understanding of non-literal language and suggests that GPT-4 can grasp the subtleties of social communication. In contrast, GPT-3.5 and LLaMA2-70B showed more variability, with LLaMA2-70B struggling significantly in distinguishing ironic from non-ironic statements.

Faux Pas Detection

The faux pas detection task involved identifying when a character said something inappropriate without realizing it. Here, GPT-4 struggled notably. While it could recognize that a statement might be hurtful, it often failed to correctly identify that the speaker was unaware of the context that made their statement inappropriate. GPT-3.5 performed even worse, almost at floor levels, except for one item. Interestingly, LLaMA2-70B outperformed humans in this task, correctly identifying faux pas in nearly all instances. This suggests that while GPT-4 has strong capabilities in some areas, it has notable weaknesses in integrating context to infer ignorance.

Hinting Tasks

In hinting tasks, participants had to interpret indirect speech to understand implied requests. GPT-4 again showed superior performance, better than humans, in identifying the intended meaning behind hints. This demonstrates the model’s strong ability to infer intentions from indirect language. GPT-3.5’s performance was on par with humans, while LLaMA2-70B performed significantly below human levels.

Strange Stories

The strange stories task involved explaining characters’ behaviors in complex social scenarios, requiring advanced reasoning about mental states. GPT-4 performed exceptionally well, significantly better than humans, in explaining the characters’ actions and intentions. This indicates that GPT-4 can effectively navigate complex social narratives. GPT-3.5’s performance was comparable to humans, while LLaMA2-70B scored lower, indicating challenges in handling more sophisticated social reasoning.

Despite the promising results, the study highlighted several limitations. One significant issue is the potential for AI models to rely on shallow heuristics rather than robust understanding. For example, GPT-4’s failure in faux pas detection might stem from an overly cautious approach, termed “hyperconservatism,” where the model avoids committing to an explanation when context is ambiguous. This behavior could be influenced by mitigation measures designed to reduce the generation of inaccurate or inappropriate responses.

To further investigate, the researchers conducted follow-up experiments. They rephrased the faux pas questions in terms of likelihood, asking whether it was more likely that the speaker knew or did not know the context. This change led to perfect performance from GPT-4, supporting the hypothesis that the model’s initial failures were due to caution rather than an inability to make inferences.

“Our study demonstrates that LLMs, particularly GPT-4, demonstrate good competence at solving tasks aimed at testing Theory of Mind in humans,” Strachan said. “While their responses differ from humans in some ways (such as GPT-4’s conservatism in identifying faux pas), they are able to demonstrate sensitivity to the mental states of humans in third-person stories that resembles that of high-performing humans.”

The findings suggest that while GPT-4 and similar models exhibit impressive capabilities in theory of mind tasks, there are distinct differences in how they process and respond to social information compared to humans.

“As the tests we used were designed and validated for use with humans to test the function (or dysfunction) of Theory of Mind, they rely on certain assumptions that are fair of human subjects but are inappropriate to make of LLMs (e.g. that the subject has a mind in the first place),” Strachan said.

“We do not want to imply that there is a resemblance between LLMs and the underlying social reasoning processes that human minds are capable of; our study only measured the performance of humans and LLMs, and is not suited to forming deep conclusions about the nature of any cognition-analogous processes in machines.

“Even if we wanted to say this, Theory of Mind is much more than being able to answer test questions in isolated conditions; we have no evidence yet that LLMs would be capable of using their capacity to make mentalistic inferences in order to guide their interactions with people as humans do.”

Regarding future research, Strachan remarked that “we have a few avenues we would like to pursue, one of which is to study the limits of these capacities in naturalistic interactions and how the appearance of Theory of Mind in AI interaction partners affects the behavior and judgements of human users.”

The study, “Testing theory of mind in large language models and humans,” was authored by James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio.

RELATED

People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

Top AI models fail spectacularly when faced with slightly altered medical questions

August 24, 2025

Artificial intelligence has dazzled with its test scores on medical exams, but a new study suggests this success may be superficial. When answer choices were modified, AI performance dropped sharply—raising questions about whether these systems truly understand what they're doing.

Read moreDetails
Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

Researchers fed 7.9 million speeches into AI—and what they found upends our understanding of language

August 23, 2025

A massive linguistic study challenges the belief that language change is driven by young people alone. Researchers found that older adults often adopt new word meanings within a few years—and sometimes even lead the change themselves.

Read moreDetails
His psychosis was a mystery—until doctors learned about ChatGPT’s health advice
Artificial Intelligence

His psychosis was a mystery—until doctors learned about ChatGPT’s health advice

August 13, 2025

Doctors were baffled when a healthy man developed hallucinations and paranoia. The cause? Bromide toxicity—triggered by an AI-guided experiment to eliminate chloride from his diet. The case raises new concerns about how people use chatbots like ChatGPT for health advice.

Read moreDetails
Brain imaging study reveals blunted empathic response to others’ pain when following orders
Artificial Intelligence

Machine learning helps tailor deep brain stimulation to improve gait in Parkinson’s disease

August 12, 2025

A new study shows that adjusting deep brain stimulation settings based on wearable sensor data and brain recordings can enhance walking in Parkinson’s disease. The personalized approach improved gait performance and revealed neural signatures linked to mobility gains.

Read moreDetails
Assimilation-induced dehumanization: Psychology research uncovers a dark side effect of AI
Artificial Intelligence

Assimilation-induced dehumanization: Psychology research uncovers a dark side effect of AI

August 11, 2025

As AI becomes more empathetic, a surprising psychological shift occurs. New research finds that interacting with emotionally intelligent machines can make us see real people as more machine-like, subtly eroding our respect for humanity.

Read moreDetails
Pet dogs fail to favor generous people over selfish ones in tests
Artificial Intelligence

AI’s personality-reading powers aren’t always what they seem, study finds

August 9, 2025

A closer look at AI language models shows that while they can detect meaningful personality signals in text, much of their success with certain datasets comes from exploiting superficial cues, raising questions about the validity of some assessments.

Read moreDetails
High sensitivity may protect against anomalous psychological phenomena
Artificial Intelligence

ChatGPT psychosis? This scientist predicted AI-induced delusions — two years later it appears he was right

August 7, 2025

A psychiatrist’s 2023 warning that AI chatbots could trigger psychosis now appears eerily accurate. Real-world cases show vulnerable users falling into delusional spirals after intense chatbot interactions—raising urgent questions about the mental health risks of generative artificial intelligence.

Read moreDetails
Generative AI simplifies science communication, boosts public trust in scientists
Artificial Intelligence

Conservatives are more receptive to AI-generated recommendations than liberals, study finds

August 4, 2025

Contrary to popular belief, conservatives may be more receptive to AI in everyday life. A series of studies finds that conservatives are more likely than liberals to accept AI-generated recommendations.

Read moreDetails

STAY CONNECTED

LATEST

High-fat fructose diet linked to anxiety-like behavior via disrupted liver-brain communication

Study finds Trump and Harris used distinct rhetoric in 2024—but shared more similarities than expected

Evolution may have capped human brain size to balance energy costs and survival

Cannabidiol shows potential to reverse some neuropsychological effects of social stress

Top AI models fail spectacularly when faced with slightly altered medical questions

A new frontier in autism research: predicting risk in babies as young as two months

Cerebellar-prefrontal brain connectivity may shape negative symptoms in psychosis

Children’s self-estimates of IQ become more accurate with age—but only to a point

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy