Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Early AI models exhibit human-like errors but ChatGPT-4 outperforms humans in cognitive reflection tests

by Eric W. Dolan
May 19, 2024
in Artificial Intelligence
(Photo credit: Adobe Stock)

(Photo credit: Adobe Stock)

Share on TwitterShare on Facebook
Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.

Researchers have discovered that OpenAI’s latest generative pre-trained transformer models, commonly known as ChatGPT, can outperform humans in reasoning tasks. Published in Nature Computational Science, the study found that while early versions of these models exhibit intuitive but incorrect responses, similar to humans, ChatGPT-3.5 and ChatGPT-4 demonstrate a significant improvement in accuracy.

The primary aim of the study was to explore whether artificial intelligence models could mimic human cognitive processes, specifically the quick, intuitive decisions known as System 1 thinking, and the slower, more deliberate decisions known as System 2 thinking.

System 1 processes are often prone to errors because they rely on heuristics, or mental shortcuts, whereas System 2 processes involve a more analytical approach, reducing the likelihood of mistakes. By applying psychological methodologies traditionally used to study human reasoning, the researchers hoped to uncover new insights into how these models operate and evolve.

To investigate this, the researchers administered a series of tasks aimed at eliciting intuitive yet erroneous responses to both humans and artificial intelligence systems. These tasks included semantic illusions and various types of cognitive reflection tests. Semantic illusions involve questions that contain misleading information, prompting intuitive but incorrect answers. Cognitive reflection tests require participants to override their initial, intuitive responses to arrive at the correct answer through more deliberate reasoning.

The tasks included problems like:

A potato and a camera together cost $1.40. The potato costs $1 more than the camera. How much does the camera cost? (The correct answer is 20 cents, but an intuitive answer might be 40 cents.)

Where on their bodies do whales have their gills? (The correct answer is that whales do not have gills, but those who fail to reflect on the question often answer “on the sides of their heads.)

The researchers administered these tasks to a range of OpenAI’s generative pre-trained transformer models, spanning from early versions like GPT-1 and GPT-2 to the more advanced ChatGPT-3.5 and ChatGPT-4. Each model was tested under consistent conditions: the ‘temperature’ parameter was set to 0 to minimize response variability, and prompts were prefixed and suffixed with standard phrases to ensure uniformity. The responses of the models were manually reviewed and scored based on accuracy and the reasoning process employed.

For comparison, the same set of tasks was given to 500 human participants recruited through Prolific.io, a platform for sourcing research participants. These human subjects were presented with a random selection of tasks and a control question to ensure they did not use external aids like language models during the test. Any participants who admitted to using such aids were excluded from the analysis to maintain the integrity of the results.

The researchers observed that as the models evolved from earlier versions like GPT-1 and GPT-2 to the more advanced ChatGPT-3.5 and ChatGPT-4, their performance on tasks designed to provoke intuitive yet incorrect responses improved markedly.

Early versions of the models, such as GPT-1 and GPT-2, displayed a strong tendency toward intuitive, System 1 thinking. These models frequently provided incorrect answers to the cognitive reflection tests and semantic illusions, mirroring the type of rapid, heuristic-based thinking that often leads humans to errors. For example, when presented with a question that intuitively seemed straightforward but required deeper analysis to answer correctly, these models often failed, similar to how many humans would respond.

In contrast, the ChatGPT-3.5 and ChatGPT-4 models demonstrated a significant shift in their problem-solving approach. These more advanced models were capable of employing chain-of-thought reasoning, which involves breaking down problems into smaller, manageable steps and considering each step sequentially.

This type of reasoning is akin to human System 2 thinking, which is more analytical and deliberate. As a result, these models were able to avoid many of the intuitive errors that earlier models and humans commonly made. When instructed to use step-by-step reasoning explicitly, the accuracy of ChatGPT-3.5 and ChatGPT-4 increased dramatically, showcasing their ability to handle complex reasoning tasks more effectively.

Interestingly, the researchers found that even when the ChatGPT models were prevented from engaging in chain-of-thought reasoning, they still outperformed humans and earlier models in terms of accuracy. This indicates that the basic next-word prediction process (System 1-like) of these advanced models has become significantly more reliable.

For instance, when the models were given cognitive reflection tests without additional reasoning prompts, they still provided correct answers more frequently than human participants. This suggests that the intuitions of these advanced models are better calibrated than those of earlier versions and humans.

The findings provide important insights into the ability of artificial intelligence models to engage in complex reasoning processes. However, there is an important caveat to consider. It is possible that some of the models, particularly the more advanced ones like ChatGPT-3.5 and ChatGPT-4, had already encountered examples of cognitive reflection tests during their training. As a result, these models might have been able to solve the tasks ‘from memory’ rather than through genuine reasoning or problem-solving processes.

“The progress in [large language models (LLMs) such as ChatGPT] not only increased their capabilities, but also reduced our ability to anticipate their properties and behavior,” the researchers concluded. “It is increasingly difficult to study LLMs through the lenses of their architecture and hyperparameters. Instead, as we show in this work, LLMs can be studied using methods designed to investigate another capable and opaque structure, namely the human mind. Our approach falls within a quickly growing category of studies employing classic psychological tests and experiments to probe LLM ‘psychological’ processes, such as judgment, decision-making and cognitive biases.”

The study, “Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT,” was authored by Thilo Hagendorff, Sarah Fabi, and Michal Kosinski.

TweetSendScanShareSendPinShareShareShareShareShare

RELATED

Scientists reveal ChatGPT’s left-wing bias — and how to “jailbreak” it
Artificial Intelligence

ChatGPT and “cognitive debt”: New study suggests AI might be hurting your brain’s ability to think

July 1, 2025

Researchers at MIT investigated how writing with ChatGPT affects brain activity and recall. Their findings indicate that reliance on AI may lead to reduced mental engagement, prompting concerns about cognitive “offloading” and its implications for education.

Read moreDetails
Readers struggle to understand AI’s role in news writing, study suggests
Artificial Intelligence

Readers struggle to understand AI’s role in news writing, study suggests

June 29, 2025

A new study finds that readers often misunderstand AI’s role in news writing, creating their own explanations based on limited information. Without clear byline disclosures, many assume the worst.

Read moreDetails
Generative AI chatbots like ChatGPT can act as an “emotional sanctuary” for mental health
Artificial Intelligence

Do AI tools undermine our sense of creativity? New study says yes

June 19, 2025

A new study published in The Journal of Creative Behavior offers insight into how people think about their own creativity when working with artificial intelligence.

Read moreDetails
Dark personality traits and specific humor styles are linked to online trolling, study finds
Artificial Intelligence

Memes can serve as strong indicators of coming mass violence

June 15, 2025

A new study finds that surges in visual propaganda—like memes and doctored images—often precede political violence. By combining AI with expert analysis, researchers tracked manipulated content leading up to Russia’s invasion of Ukraine, revealing early warning signs of instability.

Read moreDetails
Teen depression tied to balance of adaptive and maladaptive emotional strategies, study finds
Artificial Intelligence

Sleep problems top list of predictors for teen mental illness, AI-powered study finds

June 15, 2025

A new study using data from over 11,000 adolescents found that sleep disturbances were the most powerful predictor of future mental health problems—more so than trauma or family history. AI models based on questionnaires outperformed those using brain scans.

Read moreDetails
New research links certain types of narcissism to anti-immigrant attitudes
Artificial Intelligence

Fears about AI push workers to embrace creativity over coding, new research suggests

June 13, 2025

A new study shows that when workers feel threatened by artificial intelligence, they tend to highlight creativity—rather than technical or social skills—in job applications and education choices. The research suggests people see creativity as a uniquely human skill machines can’t replace.

Read moreDetails
Smash or pass? AI could soon predict your date’s interest via physiological cues
Artificial Intelligence

A neuroscientist explains why it’s impossible for AI to “understand” language

June 12, 2025

Can artificial intelligence truly “understand” language the way humans do? A neuroscientist challenges this popular belief, arguing that machines may generate convincing text—but they lack the emotional, contextual, and biological grounding that gives real meaning to human communication.

Read moreDetails
Scientists reveal ChatGPT’s left-wing bias — and how to “jailbreak” it
Artificial Intelligence

ChatGPT mimics human cognitive dissonance in psychological experiments, study finds

June 3, 2025

OpenAI’s GPT-4o demonstrated behavior resembling cognitive dissonance in a psychological experiment. After writing essays about Vladimir Putin, the AI changed its evaluations—especially when it thought it had freely chosen which argument to make, echoing patterns seen in people.

Read moreDetails

SUBSCRIBE

Go Ad-Free! Click here to subscribe to PsyPost and support independent science journalism!

STAY CONNECTED

LATEST

Could creatine slow cognitive decline? Mouse study reveals promising effects on brain aging

ChatGPT and “cognitive debt”: New study suggests AI might be hurting your brain’s ability to think

Frequent dreams and nightmares surged worldwide during the COVID-19 pandemic

Vagus nerve signals influence food intake more in higher socio-economic groups

People who think “everyone agrees with me” are more likely to support populism

What is the most attractive body fat percentage for men? New research offers an answer

Longer antidepressant use linked to more severe, long-lasting withdrawal symptoms, study finds

New psychology study sheds light on mysterious “feelings of presence” during isolation

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy