Modern AI is often judged to be more human than actual humans in Turing test experiments

Recent research published in the Proceedings of the National Academy of Sciences provides evidence that certain modern artificial intelligence systems can successfully pass a standard Turing test. When instructed to adopt a specific human personality, these computer programs fooled human judges into thinking they were real people more than half of the time. This finding provides the first empirical evidence that a modern system can pass this major scientific benchmark, raising profound questions about the future of online communication.

To fully understand this research, it helps to know a bit about large language models (LLMs). These are highly complex computer programs trained on vast amounts of text data scraped from the internet. They power the popular AI chatbots that many people use today for writing emails, brainstorming ideas, and coding software.

Large language models learn the statistical patterns of human language to predict the next word in a sequence. This allows them to generate incredibly natural-sounding text in response to user questions.

The researchers conducting this study, Cameron R. Jones and Benjamin K. Bergen, wanted to see how well these modern models could handle a classic evaluation known as the Turing test. Originally proposed by British mathematician Alan Turing in 1950, this theoretical game provides a way to evaluate whether a machine can imitate human conversation well enough to be entirely indistinguishable from a real person.

In a standard three-party version of the test, a human judge talks to two hidden participants at the exact same time using a text chat interface. One of those hidden participants is a real human, and the other is a computer program. If the human judge cannot reliably guess which participant is the machine, the computer is said to have successfully passed the test.

Jones and Bergen initiated this study because standard evaluations for artificial intelligence tend to be static and narrow. High scores on typical benchmark tests might just reflect memorization of training data rather than actual conversational adaptability. A traditional Turing test provides a flexible, interactive environment where human judges can ask open-ended questions and probe the machine for perceived weaknesses.

“The Turing test started as a way to ask whether machines could rival human intelligence,” said Bergen, a professor of cognitive science at the University of California San Diego. “But now we know AI can answer many questions faster and more accurately than people can, so the real issue isn’t raw brainpower.”

“Seeing that machines can pass the test, and seeing how they pass it, forces us to rethink what it measures,” Bergen said. “Increasingly, it’s measuring humanlikeness.”

Google News Preferences Add PsyPost to your preferred sources

To explore these questions, the scientists set up a series of highly controlled online experiments involving nearly 500 total participants. They evaluated several different artificial intelligence systems, which included GPT-4.5, LLaMa-3.1-405B, and GPT-4o. The scientists also included an older, much simpler chatbot program from the 1960s called ELIZA as a baseline measure.

In the first study, the researchers recruited 126 undergraduate students from the University of California San Diego. In the second study, they recruited 158 participants from an online survey platform called Prolific. Human participants were randomly assigned to play the role of either the judge or the human conversationalist.

To run the study, the researchers built an online interface designed to feel like a familiar messaging app. “For the interrogator, they have a split screen on their computer and they’re asking questions to both witnesses,” said Jones, who conducted the research while earning his doctorate in cognitive science from the University of California San Diego and is now an assistant professor of psychology at Stony Brook University. “They know that one of those witnesses is a human and one of them is an AI.”

The judge had exactly five minutes to chat with both participants. After the time was up, the judge had to decide which conversational partner was the real human, rate their confidence, and explain their reasoning.

The researchers tested two different sets of instructions, known as prompts, for the artificial intelligence models. The first prompt simply told the computer it was taking part in a Turing test. The second, a persona prompt, gave the computer a detailed character to play, instructing it to act like an introverted young person who was highly familiar with internet culture and modern slang.

The persona prompt made a large difference in the results. GPT-4.5 was judged to be the human 73 percent of the time, meaning it tricked the judges significantly more often than the real human participants did. The LLaMa-3.1-405B model, given the same prompt, was judged human 56 percent of the time.

“What we found is that if given the right prompts, advanced LLMs can exhibit the same tone, directness, humor and fallibility as humans,” Jones said. “While we know LLMs can easily produce knowledge on nearly every topic, this test showed that it can also convincingly display social behavioral traits, which has major implications for how we think of AI.”

However, the models performed much worse without the specific persona prompt. Without explicit instructions, GPT-4.5 fell to a 36 percent win rate, and LLaMa-3.1-405B dropped to 38 percent. The older baseline systems performed poorly overall, with ELIZA selected as human only 23 percent of the time and GPT-4o chosen 21 percent of the time.

“They have the ability to appear human-like, but maybe not as much the ability to figure out what it would take to appear human-like,” Bergen said.

The researchers then conducted a replication study to see if a longer conversation would help human judges accurately spot the machines. They recruited 205 new participants from Prolific and extended the chat time limit to fifteen minutes. They used GPT-5 alongside the LLaMa-3.1-405B model and ELIZA.

Even with more time, the human judges still struggled to identify the machines using a persona prompt. The GPT-5 model was judged to be human 59 percent of the time, and LLaMa-3.1-405B was chosen 56 percent of the time. The human judges easily identified the simple ELIZA program, guessing it was human only 17 percent of the time.

The researchers analyzed the specific conversational strategies the human judges used to figure out who was real. Most judges tried to make small talk, asking about daily activities, personal opinions, and emotional experiences. Judges tended to guess that a participant was human if they made minor spelling typos, seemed to lack knowledge about a specific topic, or responded directly without sounding overly formal.

“These traits aren’t the kinds of math and logic problem-solving intelligence that I think Turing was imagining,” Bergen said.

Additionally, the scientists noticed that the university students performed slightly better than the online participants. This suggests the students may have shared more common ground, such as local campus details, that helped them probe each other more effectively.

The authors caution against misinterpreting what the results mean. Passing a Turing test does not mean that a machine possesses genuine human intelligence or consciousness. Instead, it suggests that the machine is exceptionally good at matching human expectations of how another person might chat online.

The study also has distinct limitations. The high success rates of the large language models depended entirely on the specific persona prompt provided by the researchers. Without these detailed instructions, the models failed to consistently trick the judges, showing that they still need human guidance to behave in convincingly human ways.

Future research could explore how different types of judges perform on this classic test. Scientists might test whether experts in computer science are better at spotting artificial intelligence than the general public. Researchers might also look into whether everyday humans can be trained to recognize machine-generated text over longer periods of time.

The findings carry real-world implications for trust online. “It’s relatively easy to prompt these models to be indistinguishable from humans,” Jones said. “We need to be more alert; when you interact with strangers online people should be much less confident that they know they’re talking to a human rather than an LLM.”

“The Turing test is a game about lying for the models,” Jones said. “One of the implications is that models seem to be really good at that.”

Being unable to discern whether you are interacting with a human or a bot can have serious consequences for everyday people. “There are lots of people who would like to use bots to persuade people to share their social security numbers, and vote for their party, or buy their product,” Bergen said.

The study, “Large language models pass a standard three-party Turing test,” was authored by Cameron R. Jones and Benjamin K. Bergen.

Modern AI is often judged to be more human than actual humans in Turing test experiments

Trending

Science of Money

Recent

Welcome Back!

Retrieve your password

Add New Playlist