Users of generative AI struggle to accurately assess their own competence

New research provides evidence that using artificial intelligence to complete tasks can improve a person’s performance while simultaneously distorting their ability to assess that performance accurately. The findings indicate that while users of AI tools like ChatGPT achieve higher scores on logical reasoning tests compared to those working alone, they consistently overestimate their success by a significant margin.

This pattern suggests that AI assistance may disconnect a user’s perceived competence from their actual results, leading to a state of inflated confidence. The study was published in the scientific journal Computers in Human Behavior.

Scientists and psychologists have increasingly focused on how human cognition changes when augmented by technology. As generative AI systems become common in professional and educational settings, it is essential to understand how these tools influence metacognition. Metacognition refers to the ability of an individual to monitor and regulate their own thinking processes. It allows people to know when they are likely correct and when they might be making an error.

Previous psychological inquiries have established that humans generally struggle with self-assessment. A well-known phenomenon called the Dunning-Kruger effect describes how individuals with lower skills tend to overestimate their competence, while highly skilled individuals often underestimate their abilities. The authors of the current paper sought to determine if this pattern persists when humans collaborate with AI. They aimed to understand if AI acts as an equalizer that fixes these biases or if it introduces new complications to how people evaluate their work.

To investigate these questions, the research team designed two distinct studies centered on logical reasoning tasks. In the first study, they recruited 246 participants from the United States. These individuals were asked to complete 20 logical reasoning problems taken from the Law School Admission Test (LSAT). The researchers provided participants with a specialized web interface. This interface displayed the questions on one side and a ChatGPT interaction window on the other.

Participants were required to interact with the AI at least once for each question. They could ask the AI to solve the problem or explain the logic. After submitting their answers, participants estimated how many of the 20 questions they believed they had answered correctly. They also rated their confidence on a specific scale for each individual decision.

The results of this first study showed a clear improvement in objective performance. On average, participants using ChatGPT scored approximately three points higher than a historical control group of people who took the same test without AI assistance. The AI helped users solve problems that they likely would have missed on their own.

Despite this improvement in scores, the participants engaged in significant overestimation. On average, the group estimated they had answered about 17 out of 20 questions correctly. In reality, their average score was closer to 13. This represents a four-point gap between perception and reality. The data suggests that the seamless assistance provided by the AI created an illusion of competence.

Google News Preferences Add PsyPost to your preferred sources

The study also analyzed the relationship between a participant’s knowledge of AI and their self-assessment. The researchers measured “AI literacy” using a tool called the Scale for the Assessment of Non-Experts’ AI Literacy. One might expect that understanding how AI works would make a user more skeptical or accurate in their judgment. The findings indicated the opposite. Participants with higher technical understanding of AI tended to be more confident in their answers but less accurate in judging their actual performance.

A significant theoretical contribution of this research involves the Dunning-Kruger effect. In typical scenarios without AI, the data would show a steep slope where low performers vastly overestimate themselves and high performers do not. When participants used AI, this effect vanished. The “leveling” effect of the technology meant that overestimation became uniform across the board. Low performers and high performers alike inflated their scores by similar amounts.

The researchers observed that the combined performance of the human and the AI did not exceed the performance of the AI alone. The AI system, when running the test by itself, achieved a higher average score than the humans using the AI. This suggests a failure of synergy. Humans occasionally accepted incorrect advice from the AI or overrode correct advice, dragging the overall performance down below the machine’s maximum potential.

To ensure these findings were robust, the researchers conducted a second study. This replication involved 452 participants. The researchers split this sample into two distinct groups. One group performed the task with AI assistance, while the other group worked without any technological aid.

In this second experiment, the researchers introduced a monetary incentive to encourage accuracy. Participants were told they would receive a financial bonus if their estimate of their score matched their actual score. The goal was to rule out the possibility that participants were simply not trying hard enough to be self-aware.

The results of the second study mirrored the first. The monetary incentive did not correct the overestimation bias. The group using AI continued to perform better than the unaided group but persisted in overestimating their scores. The unaided group showed the classic Dunning-Kruger pattern, where the least skilled participants showed the most bias. The AI group again showed a uniform bias, confirming that the technology fundamentally shifts how users perceive their competence.

The study also utilized a measurement called the “Area Under the Curve” or AUC to judge metacognitive sensitivity. This metric determines if a person is more confident when they are right than when they are wrong. Ideally, a person should feel unsure when they make a mistake. The data showed that participants had low metacognitive sensitivity. Their confidence levels were high regardless of whether they were right or wrong on a specific question.

Qualitative data collected from chat logs offered additional context. The researchers noted that most participants acted as passive recipients of information. They frequently copied and pasted questions into the chat and accepted the AI’s output without significant challenge or verification. Only a small fraction of users treated the AI as a collaborative partner or a tool for double-checking their own logic.

The researchers discussed several potential reasons for these outcomes. One possibility is the “illusion of explanatory depth.” When an AI provides a fluent, articulate, and instant explanation, it can trick the brain into thinking the information is processed and understood more deeply than it actually is. The ease of obtaining the answer reduces the cognitive struggle usually required to solve logic puzzles, which in turn dulls the internal signals that warn a person they might be wrong.

As with all research, there are caveats to consider. The first study used a historical comparison group rather than a simultaneous control group, though the second study corrected this. Additionally, the task was limited to LSAT logical reasoning questions. It is possible that different types of tasks, such as creative writing or coding, might yield different metacognitive patterns.

The study also relied on a specific version of ChatGPT. As these models evolve and become more accurate, the dynamic between human and machine could shift. The researchers also noted that the participants were required to use the AI, which might differ from a real-world scenario where a user chooses when to consult the tool.

Future research directions were suggested to address these gaps. The researchers recommend investigating design changes that could force users to engage more critically. For example, an interface might require a user to explain the AI’s logic back to the system before accepting an answer. Long-term studies are also needed to see if this overconfidence fades as users become more experienced with the limitations of large language models.

The study, “AI makes you smarter but none the wiser: The disconnect between performance and metacognition,” was authored by Daniela Fernandes, Steeven Villa, Salla Nicholls, Otso Haavisto, Daniel Buschek, Albrecht Schmidt, Thomas Kosch, Chenxinran Shen, and Robin Welsch.