PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Modern AI is often judged to be more human than actual humans in Turing test experiments

by Eric W. Dolan
May 21, 2026
Reading Time: 5 mins read
Share on TwitterShare on Facebook

Recent research published in the Proceedings of the National Academy of Sciences provides evidence that certain modern artificial intelligence systems can successfully pass a standard Turing test. When instructed to adopt a specific human personality, these computer programs fooled human judges into thinking they were real people more than half of the time. This finding provides the first empirical evidence that a modern system can pass this major scientific benchmark, raising profound questions about the future of online communication.

To fully understand this research, it helps to know a bit about large language models (LLMs). These are highly complex computer programs trained on vast amounts of text data scraped from the internet. They power the popular AI chatbots that many people use today for writing emails, brainstorming ideas, and coding software.

Large language models learn the statistical patterns of human language to predict the next word in a sequence. This allows them to generate incredibly natural-sounding text in response to user questions.

The researchers conducting this study, Cameron R. Jones and Benjamin K. Bergen, wanted to see how well these modern models could handle a classic evaluation known as the Turing test. Originally proposed by British mathematician Alan Turing in 1950, this theoretical game provides a way to evaluate whether a machine can imitate human conversation well enough to be entirely indistinguishable from a real person.

In a standard three-party version of the test, a human judge talks to two hidden participants at the exact same time using a text chat interface. One of those hidden participants is a real human, and the other is a computer program. If the human judge cannot reliably guess which participant is the machine, the computer is said to have successfully passed the test.

Jones and Bergen initiated this study because standard evaluations for artificial intelligence tend to be static and narrow. High scores on typical benchmark tests might just reflect memorization of training data rather than actual conversational adaptability. A traditional Turing test provides a flexible, interactive environment where human judges can ask open-ended questions and probe the machine for perceived weaknesses.

“The Turing test started as a way to ask whether machines could rival human intelligence,” said Bergen, a professor of cognitive science at the University of California San Diego. “But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn’t raw brainpower.”

“Seeing that machines can pass the test, and seeing how they pass it, forces us to rethink what it measures,” Bergen said. “Increasingly, it’s measuring humanlikeness.”

Google News Preferences Add PsyPost to your preferred sources

To explore these questions, the scientists set up a series of highly controlled online experiments involving nearly 500 total participants. They evaluated several different artificial intelligence systems, which included GPT-4.5, LLaMa-3.1-405B, and GPT-4o. The scientists also included an older, much simpler chatbot program from the 1960s called ELIZA as a baseline measure.

In the first study, the researchers recruited 126 undergraduate students from the University of California San Diego. In the second study, they recruited 158 participants from an online survey platform called Prolific. Human participants were randomly assigned to play the role of either the judge or the human conversationalist.

To run the study, the researchers built an online interface designed to feel like a familiar messaging app. “For the interrogator, they have a split screen on their computer and they’re asking questions to both witnesses,” said Jones, who conducted the research while earning his doctorate in cognitive science from the University of California San Diego and is now an assistant professor of psychology at Stony Brook University. “They know that one of those witnesses is a human and one of them is an AI.”

The judge had exactly five minutes to chat with both participants. After the time was up, the judge had to decide which conversational partner was the real human, rate their confidence, and explain their reasoning.

The researchers tested two different sets of instructions, known as prompts, for the artificial intelligence models. The first prompt simply told the computer it was taking part in a Turing test. The second, a persona prompt, gave the computer a detailed character to play, instructing it to act like an introverted young person who was highly familiar with internet culture and modern slang.

The persona prompt made a large difference in the results. GPT-4.5 was judged to be the human 73 percent of the time, meaning it tricked the judges significantly more often than the real human participants did. The LLaMa-3.1-405B model, given the same prompt, was judged human 56 percent of the time.

“What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans,” Jones said. “While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI.”

However, the models performed much worse without the specific persona prompt. Without explicit instructions, GPT-4.5 fell to a 36 percent win rate, and LLaMa-3.1-405B dropped to 38 percent. The older baseline systems performed poorly overall, with ELIZA selected as human only 23 percent of the time and GPT-4o chosen 21 percent of the time.

“They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like,” Bergen said.

The researchers then conducted a replication study to see if a longer conversation would help human judges accurately spot the machines. They recruited 205 new participants from Prolific and extended the chat time limit to fifteen minutes. They used GPT-5 alongside the LLaMa-3.1-405B model and ELIZA.

Even with more time, the human judges still struggled to identify the machines using a persona prompt. The GPT-5 model was judged to be human 59 percent of the time, and LLaMa-3.1-405B was chosen 56 percent of the time. The human judges easily identified the simple ELIZA program, guessing it was human only 17 percent of the time.

The researchers analyzed the specific conversational strategies the human judges used to figure out who was real. Most judges tried to make small talk, asking about daily activities, personal opinions, and emotional experiences. Judges tended to guess that a participant was human if they made minor spelling typos, seemed to lack knowledge about a specific topic, or responded directly without sounding overly formal.

“These traits aren’t the kinds of math and logic problem-solving intelligence that I think Turing was imagining,” Bergen said.

Additionally, the scientists noticed that the university students performed slightly better than the online participants. This suggests the students may have shared more common ground, such as local campus details, that helped them probe each other more effectively.

The authors caution against misinterpreting what the results mean. Passing a Turing test does not mean that a machine possesses genuine human intelligence or consciousness. Instead, it suggests that the machine is exceptionally good at matching human expectations of how another person might chat online.

The study also has distinct limitations. The high success rates of the large language models depended entirely on the specific persona prompt provided by the researchers. Without these detailed instructions, the models failed to consistently trick the judges, showing that they still need human guidance to behave in convincingly human ways.

Future research could explore how different types of judges perform on this classic test. Scientists might test whether experts in computer science are better at spotting artificial intelligence than the general public. Researchers might also look into whether everyday humans can be trained to recognize machine-generated text over longer periods of time.

The findings carry real-world implications for trust online. “It’s relatively easy to prompt these models to be indistinguishable from humans,” Jones said. “We need to be more alert; when you interact with strangers online people should be much less confident that they know they’re talking to a human rather than an LLM.”

“The Turing test is a game about lying for the models,” Jones said. “One of the implications is that models seem to be really good at that.”

Being unable to discern whether you are interacting with a human or a bot can have serious consequences for everyday people. “There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product,” Bergen said.

The study, “Large language models pass a standard three-party Turing test,” was authored by Cameron R. Jones and Benjamin K. Bergen.

RELATED

Artificial intelligence flatters users into bad behavior
Artificial Intelligence

AI chatbots fail medical misinformation test, returning inaccurate and fabricated advice

June 1, 2026
Brain scans identify the neural network that traps anxious people in cycles of self-blame
ADHD Research News

Irregular brain maturation in childhood predicts emotional habits in early adolescence

May 31, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
Artificial Intelligence

New research reveals how humans judge the moral minds of artificial intelligence

May 30, 2026
Study links phubbing sensitivity to attachment patterns in romantic couples
Artificial Intelligence

Training AI chatbots to be warm and empathetic makes them less factually accurate

May 29, 2026
New Habsburg research reveals reproductive consequences of royal inbreeding
Artificial Intelligence

Machine learning uncovers how childhood trauma amplifies genetic risks for depression

May 27, 2026
People cannot tell AI-generated from human-written poetry and they like AI poetry more
Artificial Intelligence

A new study mapped 350,000 relationship stories and found a communication style AI struggles to copy

May 24, 2026
New study links manipulative personality traits to lower relationship intimacy expectations
Artificial Intelligence

Brain scans shed light on why women develop romantic feelings for AI companions

May 22, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
ADHD Research News

A new AI tool spots hidden signs of adult ADHD months before a formal diagnosis

May 21, 2026

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader. We also syndicate to Apple News.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • More than half of adults with ADHD in clinical settings have a co-occurring personality disorder
  • New study links parental indulgence to psychopathic and narcissistic traits in adulthood
  • How learning to read alters the brain’s approach to spoken language
  • The psychology of paradoxical thinking: Extreme arguments in favor of a controversial topic can reduce overall support
  • Men’s sexual desire peaks around age 40, large new study finds

Science of Money

  • Class isn’t dead: Your job title still predicts your wealth in Europe, a five-country study finds
  • Packing products tightly on shelves makes shoppers grab more flavors
  • When your job feels scriptable: How routine work and AI anxiety drain employee energy
  • Childhood obesity and the American Dream: New research links early weight to lower lifetime mobility
  • The brain chemical behind your money moves: How dopamine shapes financial choices

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc