PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

Early AI models exhibit human-like errors but ChatGPT-4 outperforms humans in cognitive reflection tests

by Eric W. Dolan
May 19, 2024
Reading Time: 4 mins read
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

Researchers have discovered that OpenAI’s latest generative pre-trained transformer models, commonly known as ChatGPT, can outperform humans in reasoning tasks. Published in Nature Computational Science, the study found that while early versions of these models exhibit intuitive but incorrect responses, similar to humans, ChatGPT-3.5 and ChatGPT-4 demonstrate a significant improvement in accuracy.

The primary aim of the study was to explore whether artificial intelligence models could mimic human cognitive processes, specifically the quick, intuitive decisions known as System 1 thinking, and the slower, more deliberate decisions known as System 2 thinking.

System 1 processes are often prone to errors because they rely on heuristics, or mental shortcuts, whereas System 2 processes involve a more analytical approach, reducing the likelihood of mistakes. By applying psychological methodologies traditionally used to study human reasoning, the researchers hoped to uncover new insights into how these models operate and evolve.

To investigate this, the researchers administered a series of tasks aimed at eliciting intuitive yet erroneous responses to both humans and artificial intelligence systems. These tasks included semantic illusions and various types of cognitive reflection tests. Semantic illusions involve questions that contain misleading information, prompting intuitive but incorrect answers. Cognitive reflection tests require participants to override their initial, intuitive responses to arrive at the correct answer through more deliberate reasoning.

The tasks included problems like:

A potato and a camera together cost $1.40. The potato costs $1 more than the camera. How much does the camera cost? (The correct answer is 20 cents, but an intuitive answer might be 40 cents.)

Where on their bodies do whales have their gills? (The correct answer is that whales do not have gills, but those who fail to reflect on the question often answer “on the sides of their heads.)

The researchers administered these tasks to a range of OpenAI’s generative pre-trained transformer models, spanning from early versions like GPT-1 and GPT-2 to the more advanced ChatGPT-3.5 and ChatGPT-4. Each model was tested under consistent conditions: the ‘temperature’ parameter was set to 0 to minimize response variability, and prompts were prefixed and suffixed with standard phrases to ensure uniformity. The responses of the models were manually reviewed and scored based on accuracy and the reasoning process employed.

Google News Preferences Add PsyPost to your preferred sources

For comparison, the same set of tasks was given to 500 human participants recruited through Prolific.io, a platform for sourcing research participants. These human subjects were presented with a random selection of tasks and a control question to ensure they did not use external aids like language models during the test. Any participants who admitted to using such aids were excluded from the analysis to maintain the integrity of the results.

The researchers observed that as the models evolved from earlier versions like GPT-1 and GPT-2 to the more advanced ChatGPT-3.5 and ChatGPT-4, their performance on tasks designed to provoke intuitive yet incorrect responses improved markedly.

Early versions of the models, such as GPT-1 and GPT-2, displayed a strong tendency toward intuitive, System 1 thinking. These models frequently provided incorrect answers to the cognitive reflection tests and semantic illusions, mirroring the type of rapid, heuristic-based thinking that often leads humans to errors. For example, when presented with a question that intuitively seemed straightforward but required deeper analysis to answer correctly, these models often failed, similar to how many humans would respond.

In contrast, the ChatGPT-3.5 and ChatGPT-4 models demonstrated a significant shift in their problem-solving approach. These more advanced models were capable of employing chain-of-thought reasoning, which involves breaking down problems into smaller, manageable steps and considering each step sequentially.

This type of reasoning is akin to human System 2 thinking, which is more analytical and deliberate. As a result, these models were able to avoid many of the intuitive errors that earlier models and humans commonly made. When instructed to use step-by-step reasoning explicitly, the accuracy of ChatGPT-3.5 and ChatGPT-4 increased dramatically, showcasing their ability to handle complex reasoning tasks more effectively.

Interestingly, the researchers found that even when the ChatGPT models were prevented from engaging in chain-of-thought reasoning, they still outperformed humans and earlier models in terms of accuracy. This indicates that the basic next-word prediction process (System 1-like) of these advanced models has become significantly more reliable.

For instance, when the models were given cognitive reflection tests without additional reasoning prompts, they still provided correct answers more frequently than human participants. This suggests that the intuitions of these advanced models are better calibrated than those of earlier versions and humans.

The findings provide important insights into the ability of artificial intelligence models to engage in complex reasoning processes. However, there is an important caveat to consider. It is possible that some of the models, particularly the more advanced ones like ChatGPT-3.5 and ChatGPT-4, had already encountered examples of cognitive reflection tests during their training. As a result, these models might have been able to solve the tasks ‘from memory’ rather than through genuine reasoning or problem-solving processes.

“The progress in [large language models (LLMs) such as ChatGPT] not only increased their capabilities, but also reduced our ability to anticipate their properties and behavior,” the researchers concluded. “It is increasingly difficult to study LLMs through the lenses of their architecture and hyperparameters. Instead, as we show in this work, LLMs can be studied using methods designed to investigate another capable and opaque structure, namely the human mind. Our approach falls within a quickly growing category of studies employing classic psychological tests and experiments to probe LLM ‘psychological’ processes, such as judgment, decision-making and cognitive biases.”

The study, “Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT,” was authored by Thilo Hagendorff, Sarah Fabi, and Michal Kosinski.

TweetSendScanShareSendPinShareShareShareShareShare

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • Excessive daydreaming is strongly linked to widespread mental health disorders
  • Advanced AI models suffer a near-total collapse on classic psychology test as cognitive demands increase
  • Harsh childhood environments shape future reproduction, but not always as evolutionary theory predicts
  • How your personal values change as you age, according to a large new study
  • New psychology research finds a subtle link between speaking speed and politeness

Science of Money

  • New York’s bottle bill raised water prices by 4%, study finds
  • The personality traits that predict smarter investing
  • Who really buys into pump-and-dump stock scams? A look inside 110,000 investor accounts
  • Do dark personality traits help workers survive a toxic boss?
  • When perfectionism collides: Why mismatched standards between you and your boss can sink your performance

Recent

  • Genetic risk for Alzheimer’s disease could depend on how well you sleep
  • Indoor radon exposure linked to altered brain development in youth
  • Brain stimulation technique alters human perception of physical control
  • People who enjoy outshining romantic rivals share distinct psychological traits across cultures
  • Lonely individuals see themselves as less empathic, study finds
  • High-fat diets and pesticide exposure alter memory differently based on genes and sex
  • Differences in birthweight between twins predict later intelligence test scores
  • People who embrace national and global identities report higher life satisfaction
  • The diploma divide is real, but college doesn’t make students as liberal as people think
  • Cameras in the statehouse do not increase political polarization, study finds

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc