PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Modern AI is often judged to be more human than actual humans in Turing test experiments

by Eric W. Dolan
May 21, 2026
Reading Time: 5 mins read
Share on TwitterShare on Facebook

Recent research published in the Proceedings of the National Academy of Sciences provides evidence that certain modern artificial intelligence systems can successfully pass a standard Turing test. When instructed to adopt a specific human personality, these computer programs fooled human judges into thinking they were real people more than half of the time. This finding provides the first empirical evidence that a modern system can pass this major scientific benchmark, raising profound questions about the future of online communication.

To fully understand this research, it helps to know a bit about large language models (LLMs). These are highly complex computer programs trained on vast amounts of text data scraped from the internet. They power the popular AI chatbots that many people use today for writing emails, brainstorming ideas, and coding software.

Large language models learn the statistical patterns of human language to predict the next word in a sequence. This allows them to generate incredibly natural-sounding text in response to user questions.

The researchers conducting this study, Cameron R. Jones and Benjamin K. Bergen, wanted to see how well these modern models could handle a classic evaluation known as the Turing test. Originally proposed by British mathematician Alan Turing in 1950, this theoretical game provides a way to evaluate whether a machine can imitate human conversation well enough to be entirely indistinguishable from a real person.

In a standard three-party version of the test, a human judge talks to two hidden participants at the exact same time using a text chat interface. One of those hidden participants is a real human, and the other is a computer program. If the human judge cannot reliably guess which participant is the machine, the computer is said to have successfully passed the test.

Jones and Bergen initiated this study because standard evaluations for artificial intelligence tend to be static and narrow. High scores on typical benchmark tests might just reflect memorization of training data rather than actual conversational adaptability. A traditional Turing test provides a flexible, interactive environment where human judges can ask open-ended questions and probe the machine for perceived weaknesses.

“The Turing test started as a way to ask whether machines could rival human intelligence,” said Bergen, a professor of cognitive science at the University of California San Diego. “But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn’t raw brainpower.”

“Seeing that machines can pass the test, and seeing how they pass it, forces us to rethink what it measures,” Bergen said. “Increasingly, it’s measuring humanlikeness.”

Google News Preferences Add PsyPost to your preferred sources

To explore these questions, the scientists set up a series of highly controlled online experiments involving nearly 500 total participants. They evaluated several different artificial intelligence systems, which included GPT-4.5, LLaMa-3.1-405B, and GPT-4o. The scientists also included an older, much simpler chatbot program from the 1960s called ELIZA as a baseline measure.

In the first study, the researchers recruited 126 undergraduate students from the University of California San Diego. In the second study, they recruited 158 participants from an online survey platform called Prolific. Human participants were randomly assigned to play the role of either the judge or the human conversationalist.

To run the study, the researchers built an online interface designed to feel like a familiar messaging app. “For the interrogator, they have a split screen on their computer and they’re asking questions to both witnesses,” said Jones, who conducted the research while earning his doctorate in cognitive science from the University of California San Diego and is now an assistant professor of psychology at Stony Brook University. “They know that one of those witnesses is a human and one of them is an AI.”

The judge had exactly five minutes to chat with both participants. After the time was up, the judge had to decide which conversational partner was the real human, rate their confidence, and explain their reasoning.

The researchers tested two different sets of instructions, known as prompts, for the artificial intelligence models. The first prompt simply told the computer it was taking part in a Turing test. The second, a persona prompt, gave the computer a detailed character to play, instructing it to act like an introverted young person who was highly familiar with internet culture and modern slang.

The persona prompt made a large difference in the results. GPT-4.5 was judged to be the human 73 percent of the time, meaning it tricked the judges significantly more often than the real human participants did. The LLaMa-3.1-405B model, given the same prompt, was judged human 56 percent of the time.

“What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans,” Jones said. “While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI.”

However, the models performed much worse without the specific persona prompt. Without explicit instructions, GPT-4.5 fell to a 36 percent win rate, and LLaMa-3.1-405B dropped to 38 percent. The older baseline systems performed poorly overall, with ELIZA selected as human only 23 percent of the time and GPT-4o chosen 21 percent of the time.

“They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like,” Bergen said.

The researchers then conducted a replication study to see if a longer conversation would help human judges accurately spot the machines. They recruited 205 new participants from Prolific and extended the chat time limit to fifteen minutes. They used GPT-5 alongside the LLaMa-3.1-405B model and ELIZA.

Even with more time, the human judges still struggled to identify the machines using a persona prompt. The GPT-5 model was judged to be human 59 percent of the time, and LLaMa-3.1-405B was chosen 56 percent of the time. The human judges easily identified the simple ELIZA program, guessing it was human only 17 percent of the time.

The researchers analyzed the specific conversational strategies the human judges used to figure out who was real. Most judges tried to make small talk, asking about daily activities, personal opinions, and emotional experiences. Judges tended to guess that a participant was human if they made minor spelling typos, seemed to lack knowledge about a specific topic, or responded directly without sounding overly formal.

“These traits aren’t the kinds of math and logic problem-solving intelligence that I think Turing was imagining,” Bergen said.

Additionally, the scientists noticed that the university students performed slightly better than the online participants. This suggests the students may have shared more common ground, such as local campus details, that helped them probe each other more effectively.

The authors caution against misinterpreting what the results mean. Passing a Turing test does not mean that a machine possesses genuine human intelligence or consciousness. Instead, it suggests that the machine is exceptionally good at matching human expectations of how another person might chat online.

The study also has distinct limitations. The high success rates of the large language models depended entirely on the specific persona prompt provided by the researchers. Without these detailed instructions, the models failed to consistently trick the judges, showing that they still need human guidance to behave in convincingly human ways.

Future research could explore how different types of judges perform on this classic test. Scientists might test whether experts in computer science are better at spotting artificial intelligence than the general public. Researchers might also look into whether everyday humans can be trained to recognize machine-generated text over longer periods of time.

The findings carry real-world implications for trust online. “It’s relatively easy to prompt these models to be indistinguishable from humans,” Jones said. “We need to be more alert; when you interact with strangers online people should be much less confident that they know they’re talking to a human rather than an LLM.”

“The Turing test is a game about lying for the models,” Jones said. “One of the implications is that models seem to be really good at that.”

Being unable to discern whether you are interacting with a human or a bot can have serious consequences for everyday people. “There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product,” Bergen said.

The study, “Large language models pass a standard three-party Turing test,” was authored by Cameron R. Jones and Benjamin K. Bergen.

RELATED

AI-assisted venting can boost psychological well-being, study suggests
Addiction

Artificial intelligence tools answer addiction questions accurately but lack medical nuance

May 15, 2026
Scientists trained AI to talk people out of conspiracy theories — and it worked surprisingly well
Artificial Intelligence

Real-world evidence shows generative AI is making human creative output more uniform

May 14, 2026
Blue light exposure may counteract anxiety caused by chronic vibration
Addiction

AI-designed drug reduces fentanyl consumption in animal models by targeting serotonin receptors

May 12, 2026
Childhood ADHD traits linked to midlife distress, with societal exclusion playing a major role
Artificial Intelligence

ChatGPT’s free version is 26 times more likely to respond inappropriately to psychotic delusions

May 9, 2026
Mind captioning: This scientist just used AI to translate brain activity into text
Artificial Intelligence

Scientists tested AI’s moral compass, and the results reveal a key blind spot

May 8, 2026
Scientists show how common chord progressions unlock social bonding in the brain
Artificial Intelligence

Perpetrators of AI sexual abuse often view their actions as a joke, new research shows

May 7, 2026
AI outshines humans in humor: Study finds ChatGPT is as funny as The Onion
Artificial Intelligence

Conversational AI shows promise in easing symptoms of anxiety and depression

May 6, 2026
The surprising link between conspiracy mentality and deepfake detection ability
Artificial Intelligence

Deepfake videos degrade political reputations even when viewers realize they are fake

May 5, 2026

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader. We also syndicate to Apple News.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • Liberals hesitate to share progressive causes framed with conservative moral language
  • A simple at-home sexual fantasy exercise increases pleasure and reduces distress
  • Feeling empty after finishing a video game? Researchers say post-game depression is a real phenomenon
  • Intelligence makes people more trusting, but early hardship cuts this benefit in half
  • Scientists just revealed a strange quirk in how we exit train stations

Science of Money

  • The psychology of “manifesting”: Why believers feel more successful but often aren’t
  • How AI is rewriting the marketer’s playbook, according to a wide-ranging literature review
  • When a CEO’s foreign accent becomes an asset: What investors actually hear
  • Congressional stock trades look a lot like retail investing, new study finds
  • Researchers identify a costly pattern in consumer debt repayment

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc