PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Modern AI is often judged to be more human than actual humans in Turing test experiments

by Eric W. Dolan
May 21, 2026
Reading Time: 5 mins read
Share on TwitterShare on Facebook

Recent research published in the Proceedings of the National Academy of Sciences provides evidence that certain modern artificial intelligence systems can successfully pass a standard Turing test. When instructed to adopt a specific human personality, these computer programs fooled human judges into thinking they were real people more than half of the time. This finding provides the first empirical evidence that a modern system can pass this major scientific benchmark, raising profound questions about the future of online communication.

To fully understand this research, it helps to know a bit about large language models (LLMs). These are highly complex computer programs trained on vast amounts of text data scraped from the internet. They power the popular AI chatbots that many people use today for writing emails, brainstorming ideas, and coding software.

Large language models learn the statistical patterns of human language to predict the next word in a sequence. This allows them to generate incredibly natural-sounding text in response to user questions.

The researchers conducting this study, Cameron R. Jones and Benjamin K. Bergen, wanted to see how well these modern models could handle a classic evaluation known as the Turing test. Originally proposed by British mathematician Alan Turing in 1950, this theoretical game provides a way to evaluate whether a machine can imitate human conversation well enough to be entirely indistinguishable from a real person.

In a standard three-party version of the test, a human judge talks to two hidden participants at the exact same time using a text chat interface. One of those hidden participants is a real human, and the other is a computer program. If the human judge cannot reliably guess which participant is the machine, the computer is said to have successfully passed the test.

Jones and Bergen initiated this study because standard evaluations for artificial intelligence tend to be static and narrow. High scores on typical benchmark tests might just reflect memorization of training data rather than actual conversational adaptability. A traditional Turing test provides a flexible, interactive environment where human judges can ask open-ended questions and probe the machine for perceived weaknesses.

“The Turing test started as a way to ask whether machines could rival human intelligence,” said Bergen, a professor of cognitive science at the University of California San Diego. “But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn’t raw brainpower.”

“Seeing that machines can pass the test, and seeing how they pass it, forces us to rethink what it measures,” Bergen said. “Increasingly, it’s measuring humanlikeness.”

Google News Preferences Add PsyPost to your preferred sources

To explore these questions, the scientists set up a series of highly controlled online experiments involving nearly 500 total participants. They evaluated several different artificial intelligence systems, which included GPT-4.5, LLaMa-3.1-405B, and GPT-4o. The scientists also included an older, much simpler chatbot program from the 1960s called ELIZA as a baseline measure.

In the first study, the researchers recruited 126 undergraduate students from the University of California San Diego. In the second study, they recruited 158 participants from an online survey platform called Prolific. Human participants were randomly assigned to play the role of either the judge or the human conversationalist.

To run the study, the researchers built an online interface designed to feel like a familiar messaging app. “For the interrogator, they have a split screen on their computer and they’re asking questions to both witnesses,” said Jones, who conducted the research while earning his doctorate in cognitive science from the University of California San Diego and is now an assistant professor of psychology at Stony Brook University. “They know that one of those witnesses is a human and one of them is an AI.”

The judge had exactly five minutes to chat with both participants. After the time was up, the judge had to decide which conversational partner was the real human, rate their confidence, and explain their reasoning.

The researchers tested two different sets of instructions, known as prompts, for the artificial intelligence models. The first prompt simply told the computer it was taking part in a Turing test. The second, a persona prompt, gave the computer a detailed character to play, instructing it to act like an introverted young person who was highly familiar with internet culture and modern slang.

The persona prompt made a large difference in the results. GPT-4.5 was judged to be the human 73 percent of the time, meaning it tricked the judges significantly more often than the real human participants did. The LLaMa-3.1-405B model, given the same prompt, was judged human 56 percent of the time.

“What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans,” Jones said. “While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI.”

However, the models performed much worse without the specific persona prompt. Without explicit instructions, GPT-4.5 fell to a 36 percent win rate, and LLaMa-3.1-405B dropped to 38 percent. The older baseline systems performed poorly overall, with ELIZA selected as human only 23 percent of the time and GPT-4o chosen 21 percent of the time.

“They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like,” Bergen said.

The researchers then conducted a replication study to see if a longer conversation would help human judges accurately spot the machines. They recruited 205 new participants from Prolific and extended the chat time limit to fifteen minutes. They used GPT-5 alongside the LLaMa-3.1-405B model and ELIZA.

Even with more time, the human judges still struggled to identify the machines using a persona prompt. The GPT-5 model was judged to be human 59 percent of the time, and LLaMa-3.1-405B was chosen 56 percent of the time. The human judges easily identified the simple ELIZA program, guessing it was human only 17 percent of the time.

The researchers analyzed the specific conversational strategies the human judges used to figure out who was real. Most judges tried to make small talk, asking about daily activities, personal opinions, and emotional experiences. Judges tended to guess that a participant was human if they made minor spelling typos, seemed to lack knowledge about a specific topic, or responded directly without sounding overly formal.

“These traits aren’t the kinds of math and logic problem-solving intelligence that I think Turing was imagining,” Bergen said.

Additionally, the scientists noticed that the university students performed slightly better than the online participants. This suggests the students may have shared more common ground, such as local campus details, that helped them probe each other more effectively.

The authors caution against misinterpreting what the results mean. Passing a Turing test does not mean that a machine possesses genuine human intelligence or consciousness. Instead, it suggests that the machine is exceptionally good at matching human expectations of how another person might chat online.

The study also has distinct limitations. The high success rates of the large language models depended entirely on the specific persona prompt provided by the researchers. Without these detailed instructions, the models failed to consistently trick the judges, showing that they still need human guidance to behave in convincingly human ways.

Future research could explore how different types of judges perform on this classic test. Scientists might test whether experts in computer science are better at spotting artificial intelligence than the general public. Researchers might also look into whether everyday humans can be trained to recognize machine-generated text over longer periods of time.

The findings carry real-world implications for trust online. “It’s relatively easy to prompt these models to be indistinguishable from humans,” Jones said. “We need to be more alert; when you interact with strangers online people should be much less confident that they know they’re talking to a human rather than an LLM.”

“The Turing test is a game about lying for the models,” Jones said. “One of the implications is that models seem to be really good at that.”

Being unable to discern whether you are interacting with a human or a bot can have serious consequences for everyday people. “There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product,” Bergen said.

The study, “Large language models pass a standard three-party Turing test,” was authored by Cameron R. Jones and Benjamin K. Bergen.

TweetSendScanShareSendPinShareShareShareShareShare

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • Why opposites don’t attract: A global study reveals the true rules of romantic compatibility
  • An 80-year-old woman with advanced Alzheimer’s regained speech and mobility after taking psilocybin
  • Excessive daydreaming is strongly linked to widespread mental health disorders
  • Advanced AI models suffer a near-total collapse on classic psychology test as cognitive demands increase
  • Harsh childhood environments shape future reproduction, but not always as evolutionary theory predicts

Science of Money

  • The hidden cost of chasing quotas in business-to-business sales
  • What happens inside a trader’s head when the market turns against them?
  • Crypto’s “ecology of noise” and how investors try to survive it
  • What makes a TikTok ad stick? A study breaks down the sights and sounds that drive engagement
  • Can ChatGPT outperform a human financial planner? A controlled experiment weighs in

Recent

  • How people interpret life milestones is tied to how their personalities develop
  • Baby teeth reveal how early metal exposures shape the adolescent brain
  • Love and money both matter for health, but they don’t replace each other
  • Men and women show different psychological links between the “fit ideal” and risky behaviors
  • Parents invest differently in daughters and sons, study finds
  • Scientists discover deep brain stimulation physically reshapes the brain’s information superhighway
  • Prenatal exposure to air pollution is linked to increased attention issues in children
  • A balanced diet of video games is associated with greater stoicism and less isolation
  • Competitive students use ChatGPT to memorize trivia instead of actually learning
  • Simple reminders of God make us crave junk food, according to new psychology research

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc