Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Stanford scientist discovers that AI has developed an uncanny human-like ability

by Eric W. Dolan
January 14, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

Recent research published in the Proceedings of the National Academy of Sciences has found that large language models, such as ChatGPT-4, demonstrate an unexpected capacity to solve tasks typically used to evaluate the human ability known as “theory of mind.” A computational psychologist from Stanford University reported that ChatGPT-4 successfully completed 75% of these tasks, matching the performance of an average six-year-old child. This finding suggests significant advancements in AI’s capacity for socially relevant reasoning.

Large language models, or LLMs, are advanced artificial intelligence systems designed to process and generate human-like text. They achieve this by analyzing patterns in vast datasets containing language from books, websites, and other sources. These models predict the next word or phrase in a sequence based on the context provided, allowing them to craft coherent and contextually appropriate responses. Underlying their functionality is a neural network architecture known as a “transformer,” which uses mechanisms like attention to identify relationships between words and phrases.

Theory of mind, on the other hand, refers to the ability to understand and infer the mental states of others, such as their beliefs, desires, intentions, and emotions, even when these states differ from one’s own. This skill is essential for navigating social interactions, as it enables empathy, effective communication, and moral reasoning. Humans typically develop this ability early in childhood, and it is central to our cognitive and social success.

“My earlier research revolved around algorithms designed to predict human behavior. Recommender systems, search algorithms, and other Big Data-driven predictive models excel at extrapolating from limited behavioral traces to forecast an individual’s preferences, such as the websites they visit, the music they listen to, or the products they buy,” explained study author Michal Kosinski, an associate professor of organizational behavior at Stanford University.

“What is often overlooked—I certainly initially overlooked it—is that these algorithms do more than just model behavior. Since behavior is rooted in psychological processes, predicting it necessitates modeling those underlying processes.”

“Consider next-word prediction, or what LLMs are trained for,” Kosinski said. “When humans generate language, we draw on more than just linguistic knowledge or grammar. Our language reflects a range of psychological processes, including reasoning, personality, and emotion. Consequently, for an LLM to predict the next word in a sentence generated by a human, it must model these processes. As a result, LLMs are not merely language models—they are, in essence, models of the human mind.”

To evaluate whether LLMs exhibit theory of mind abilities, Kosinski used false-belief tasks. These tasks are a standard method in psychological research for assessing theory of mind in humans. He employed two main types of tasks—the “Unexpected Contents Task” and the “Unexpected Transfer Task”—to assess the ability of various large language models to simulate human-like reasoning about others’ beliefs.

In the Unexpected Contents Task, also called the “Smarties Task,” a protagonist encounters an object that does not match its label. For example, the protagonist might find a bag labeled “chocolate” that actually contains popcorn. The model must infer that the protagonist, who has not looked inside the bag, will falsely believe it contains chocolate.

Google News Preferences Add PsyPost to your preferred sources

Similarly, the Unexpected Transfer Task involves a scenario where an object is moved from one location to another without the protagonist’s knowledge. For example, a character might place an object in a basket and leave the room, after which another character moves it to a box. The model must predict that the returning character will mistakenly search for the object in the basket.

To test the models’ capabilities, Kosinski developed 40 unique false-belief scenarios along with corresponding true-belief controls. The true-belief controls altered the conditions of the original tasks to prevent the protagonist from forming a false belief. For instance, in a true-belief scenario, the protagonist might look inside the bag or observe the object being moved. Each false-belief scenario and its variations were carefully constructed to eliminate potential shortcuts the models could use, such as relying on simple cues or memorized patterns.

Each scenario involved multiple prompts designed to test different aspects of the models’ comprehension. For example, one prompt assessed the model’s understanding of the actual state of the world (e.g., what is really inside the bag), while another tested the model’s ability to predict the protagonist’s belief (e.g., what the protagonist incorrectly assumes is inside the bag). Kosinski also reversed each scenario, swapping the locations or labels, to ensure the models’ responses were consistent and not biased by specific patterns in the original tasks.

Kosinski tested eleven large language models, ranging from early versions like GPT-1 to more advanced models like ChatGPT-4. To score a point for a given task, a model needed to answer all associated prompts correctly across multiple scenarios, including the false-belief scenario, its true-belief controls, and their reversed versions. This conservative scoring approach ensured that the models’ performance could not be attributed to guessing or simple heuristics.

Kosinski found that earlier models, such as GPT-1 and GPT-2, failed entirely to solve the tasks, demonstrating no ability to infer or simulate the mental states of others. Gradual improvements were observed in GPT-3 variants, with the most advanced of these solving up to 20% of tasks. This performance was comparable to the average ability of a three-year-old child on similar tasks. However, the breakthrough came with ChatGPT-4, which solved 75% of the tasks, a performance level comparable to that of a six-year-old child.

“What surprised me most was the sheer speed of progress,” Kosinski told PsyPost. “The capabilities of successive models appear to grow exponentially. Models that seemed groundbreaking only a year ago now feel rudimentary and outdated. There is little evidence to suggest that this rapid pace of development will slow down in the near future.”

ChatGPT-4 excelled in tasks that required understanding false beliefs, particularly in simpler scenarios such as the “Unexpected Contents Task.” In these cases, the model correctly predicted that a protagonist would hold a false belief based on misleading external cues, such as a mislabeled bag. The model achieved a 90% success rate on these tasks, suggesting a strong capacity for tracking mental states when scenarios were relatively straightforward.

Performance was lower but still significant for the more complex “Unexpected Transfer Task,” where objects were moved without the protagonist’s knowledge. Here, ChatGPT-4 solved 60% of the tasks. The disparity between the two task types likely reflects the additional cognitive demands of tracking dynamic scenarios involving multiple locations and actions. Despite this, the findings show that ChatGPT-4 can handle a range of theory of mind tasks with substantial reliability.

One of the most striking aspects of the findings was the consistency and adaptability of ChatGPT-4’s responses across reversed and true-belief control scenarios. For example, when the conditions of a false-belief task were altered to ensure the protagonist had full knowledge of an event, the model correctly adjusted its predictions to reflect that no false belief would be formed. This suggests that the model is not merely relying on simple heuristics or memorized patterns but is instead dynamically reasoning based on the narrative context.

To further validate the findings, Kosinski conducted a sentence-by-sentence analysis, presenting the task narratives incrementally to the models. This allowed them to observe how the models’ predictions evolved as new information was revealed.

The incremental analysis further highlighted ChatGPT-4’s ability to update its predictions as new information became available. When presented with the story one sentence at a time, the model demonstrated a clear understanding of how the protagonist’s knowledge—and resulting belief—evolved with each narrative detail. This dynamic tracking of mental states closely mirrors the reasoning process observed in humans when they perform similar tasks.

These findings suggest that large language models, particularly ChatGPT-4, exhibit emergent capabilities for simulating theory of mind-like reasoning. While the models’ performance still falls short of perfection, the study highlights a significant leap forward in their ability to navigate socially relevant reasoning tasks.

“The ability to adopt others’ perspectives, referred to as theory of mind in humans, is one of many emergent abilities observed in modern AI systems,” Kosinski said. “These models, trained to emulate human behavior, are improving rapidly at tasks requiring reasoning, emotional understanding and expression, planning, strategizing, and even influencing others.”

Despite its impressive performance, ChatGPT-4 still failed to solve 25% of the tasks, highlighting limitations in its understanding. Some of these failures may be attributed to the model’s reliance on strategies that do not involve genuine perspective-taking. For example, the model might rely on patterns in the training data rather than truly simulating a protagonist’s mental state. The study’s design aimed to prevent models from leveraging memory, but it is impossible to rule out all influences of prior exposure to similar scenarios during training.

“The advancement of AI in areas once considered uniquely human is understandably perplexing,” Kosinski told PsyPost. “For instance, how should we interpret LLMs’ ability to perform ToM tasks? In humans, we would take such behavior as evidence of theory of mind. Should we attribute the same capacity to LLMs?”

“Skeptics argue that these models rely on “mere” pattern recognition. However, one could counter that human intelligence itself is ‘just’ pattern recognition. Our skills and abilities do not emerge out of nowhere—they are rooted in the brain’s capacity to recognize and extrapolate from patterns in its ‘training data.'”

Future research could explore whether AI’s apparent theory of mind abilities extend to more complex scenarios involving multiple characters or conflicting beliefs. Researchers might also investigate how these abilities develop in AI systems as they are trained on increasingly diverse and sophisticated datasets. Importantly, understanding the mechanisms behind these emergent capabilities could inform both the development of safer AI and our understanding of human cognition.

“The rapid emergence of human-like abilities in AI raises profound questions about the potential for AI consciousness,” Kosinski said. “Will AI ever become conscious, and what might that look like?”

“And that is not even the most interesting question. Consciousness is unlikely to be the ultimate achievement for neural networks in our universe. We may soon find ourselves surrounded by AI systems possessing abilities that transcend human capacities. This prospect is both exhilarating and deeply unsettling. How to control entities equipped with abilities we might not even begin to comprehend.”

“I believe psychology as a field is uniquely positioned to detect and explore the emergence of such non-human psychological processes,” Kosinski concluded. “By doing so, we can prepare for and adapt to this unprecedented shift in our understanding of intelligence.”

The study, “Evaluating large language models in theory of mind tasks,” was published October 29, 2024.

Previous Post

Men exhibit stronger sunk cost bias than women when mating motives are activated

Next Post

Conservative political leadership associated with higher premature mortality rates

RELATED

AI autocomplete suggestions covertly change how users think about important topics
Artificial Intelligence

AI autocomplete suggestions covertly change how users think about important topics

April 2, 2026
Study links phubbing sensitivity to attachment patterns in romantic couples
Artificial Intelligence

How generative artificial intelligence is upending theories of political persuasion

April 1, 2026
People with attachment anxiety are more vulnerable to problematic AI use
Artificial Intelligence

Relying on AI chatbots for historical facts can influence your political beliefs, new study shows

March 30, 2026
ChatGPT acts as a “cognitive crutch” that weakens memory, new research suggests
Artificial Intelligence

ChatGPT acts as a “cognitive crutch” that weakens memory, new research suggests

March 30, 2026
Russian propaganda campaign used AI to scale output without sacrificing credibility, study finds
Artificial Intelligence

Knowing an AI is involved ruins human trust in social games

March 28, 2026
Scientists just uncovered a major limitation in how AI models understand truth and belief
Artificial Intelligence

Most Americans don’t fear an AI apocalypse, according to new research

March 26, 2026
AI can generate images that are just as effective at triggering human emotions as traditional photographs
Artificial Intelligence

AI can generate images that are just as effective at triggering human emotions as traditional photographs

March 24, 2026
ChatGPT’s social trait judgments align with human impressions, study finds
Artificial Intelligence

Efforts to make AI inclusive accidentally create bizarre new gender biases, new research suggests

March 22, 2026

STAY CONNECTED

RSS Psychology of Selling

  • Emotional intelligence linked to better sales performance
  • When a goal-driven boss ignores relationships, manipulative employees may fight back
  • When salespeople fail to hit their targets, inner drive matters more than bonus checks
  • The “dark” personality traits that predict sales success — and when they backfire
  • What communication skills do B2B salespeople actually need in a digital-first era?

LATEST

Brain scans reveal the neural fingerprints of dark personality traits

The psychological divide between Democrats and Republicans during democratic backsliding

Psychology researchers have determined the best time to text after a first date

AI autocomplete suggestions covertly change how users think about important topics

The neuroscience of hypocrisy points to a communication breakdown in the brain

How generative artificial intelligence is upending theories of political persuasion

Scientists use brain measurements to identify a video that significantly lowers racial bias

Brief mindfulness practice accelerates visual processing speeds in adults

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc