Unrestricted generative AI harms high school math learning by acting as a crutch

A recent study published in the Proceedings of the National Academy of Sciences suggests that giving high school students unrestricted access to artificial intelligence for math practice can impair their ability to learn the material. While AI chatbots help students correctly solve problems when the technology is available, removing the AI causes these students to perform worse on independent exams than peers who never used the tools.

Generative artificial intelligence refers to computer programs that can instantly create original content, such as conversational text or mathematical solutions, by analyzing patterns in massive amounts of existing data. While this technology has become highly popular, its impact on student development remains a subject of debate.

“Generative AI arrived with enormous excitement about its productivity potential. But if AI undermines skill development, it could harm both productivity and cognitive abilities in the long run. At the time, there was remarkably little causal evidence on AI’s effects in education,” said study author Alp Sungu, an assistant professor at the Wharton School at the University of Pennsylvania.

To investigate whether these systems aid or impede human learning, the scientists explored a concept educators and psychiatrists call cognitive debt or cognitive atrophy. This theory suggests that outsourcing reasoning to machines might erode the brain’s ability to think critically. When a computer completes an assignment, the user misses out on the productive struggle required to solve complex problems independently.

This loss of independent problem solving is especially risky because AI models frequently generate false information. People must maintain the expertise required to evaluate the machine’s output. If students rely on algorithms early in their education without practicing these mental gymnastics, they may fail to develop the foundational skills needed for future success.

To test how different software configurations affect these outcomes, the scientists conducted a randomized controlled trial at a large high school in Turkey during the Fall 2023 semester. The sample included nearly 1,000 students across about fifty regular classrooms in the ninth, tenth, and eleventh grades. The researchers focused on mathematics instruction, dedicating four ninety minute sessions to topics that comprised about fifteen percent of the semester’s curriculum.

Students were divided into three groups. The first group served as a control, meaning they completed practice problems using only their standard textbooks and class notes. The second group received laptops equipped with a program called GPT Base, which acted like a standard ChatGPT interface and was instructed simply to act as a math tutor.

The third group received laptops with a program called GPT Tutor. This application used the same interface but included invisible background instructions designed by teachers to safeguard the learning process. These specific instructions forced the AI to provide step by step hints instead of direct answers, and fed the AI the correct solutions so it would not invent false information.

Google News Preferences Add PsyPost to your preferred sources

Each class session was split into three distinct parts. First, the teacher provided a traditional lecture and solved examples on the board. Next, the students engaged in an assisted practice period where they worked through specific problems designed to reinforce the lecture concepts.

During this second phase, students used the resources assigned to their specific group, whether that was textbooks or one of the two AI programs. Finally, students completed an unassisted exam containing problems conceptually identical to the practice exercises. During this last phase, all students worked entirely independently with no access to laptops, books, or notes.

To ensure students took the task seriously, performance on both the practice problems and the final exam counted toward their actual class grades. Independent graders scored the paper submissions using a standardized rubric to eliminate any potential teacher bias.

The researchers observed a massive boost in performance during the practice phase for students using artificial intelligence. Compared to the control group, students using the unrestricted GPT Base scored forty eight percent higher, while students using the guided GPT Tutor scored one hundred and twenty seven percent higher.

When the researchers evaluated the independent exam results, a very different pattern emerged. Students in the unrestricted GPT Base group performed seventeen percent worse on the unassisted exam than students in the control group who never had access to a computer. Students in the guided GPT Tutor group performed statistically the same as the control group, indicating that the software constraints prevented the learning loss seen in the unrestricted group.

“Honestly, the main result did surprised me! I wasn’t expecting AI to harm student learning (you generally don’t run a study expecting the treatment to hurt outcomes!),” Sungu told PsyPost.

“The effects are non-trivial and move in opposite directions depending on what you measure. With AI access, students scored 48% higher on practice problems, but once that access was removed, the same students scored 17% lower on exams than those who never had AI at all. These aren’t subtle differences, they suggest that how students interact with AI during learning fundamentally shapes what they retain.”

By analyzing the chat logs between the students and the artificial intelligence, scientists identified why the unrestricted AI harmed learning. The vast majority of students using GPT Base simply asked the program for the correct answer. The AI obliged, but the researchers found it provided the correct answer only fifty one percent of the time, frequently making logical and arithmetic errors.

Students would blindly copy these flawed answers onto their practice sheets without fully understanding the material. Because they used the tool as a crutch, they were entirely unprepared when they had to solve similar problems on their own. In contrast, students using GPT Tutor were forced to interact with the material, asking for help and attempting to solve the problems themselves, which preserved their ability to learn.

“Within traditional educational settings, students tend to use standard generative AI tools as an answer machine, a way to get homework done rather than to learn,” Sungu said. “That shortcut can hurt their actual learning.”

The researchers also discovered a significant mismatch between how much students thought they learned and their actual performance. Students using the unrestricted GPT Base program felt they learned just as much as their peers, despite failing the independent exam at higher rates. Meanwhile, students using the guided GPT Tutor perceived that they performed significantly better, even though their final exam scores only matched the control group.

A common misinterpretation of these findings is the assumption that well designed artificial intelligence cannot improve educational outcomes at all. The guided AI program in this study did not cause students to outperform the control group on the final exam, but it did use relatively simple background programming. The researchers suggest that more advanced systems acting as proactive tutors could eventually yield positive learning benefits.

“We tested one tool with fairly simple prompt engineering,” Sungu noted. “Much more work is needed on long-term learning, skill development, and downstream productivity effects.”

A major limitation of the study is its specific context, as it only evaluated mathematics instruction in a single high school during a time when AI chatbots were still a novel technology. The scientists point out that different subjects, such as writing, lack objective evaluation criteria and might interact differently with artificial intelligence. Additionally, researchers only measured short term outcomes rather than evaluating skills over an extended period.

Future research should examine the long term effects of AI use on skill development to see if extended reliance leads to broader cognitive atrophy. Scientists hope to evaluate new educational policies that guide these digital tools toward societal benefit rather than intellectual harm.

“AI isn’t just changing how we work. It’s changing how we think,” Sungu explained. “That makes education the front line. My goal is to study how AI and other digital technologies reshape human capital development more broadly, and to design and test policies that steer these tools toward societal value rather than a potential cognitive atrophy.”

The study, “Generative AI without guardrails can harm learning: Evidence from high school mathematics,” was authored by Hamsa Bastani, Osbert Bastani, Alp Sungu, Haosen Ge, Özge Kabakcı, and Rei Mariman.

Unrestricted generative AI harms high school math learning by acting as a crutch

Lifting weights builds a sharper mind and reduces anxiety in older women

RELATED

Listening to bad music makes you crave sugar, study finds

People remain “blissfully ignorant” of AI use in everyday messages, new research shows

Cognition might emerge from embodied “grip” with the world rather than abstract mental processes

Men and women show different relative cognitive strengths across their lifespans

Soft brain implants outperform rigid silicon in long-term safety study

Disclosing autism to AI chatbots prompts overly cautious, stereotypical advice

Scientists tested the creativity of AI models, and the results were surprisingly homogeneous

Live music causes brain waves to synchronize more strongly with rhythm than recorded music

STAY CONNECTED

Psychology of Selling

LATEST

Lifting weights builds a sharper mind and reduces anxiety in older women

How a perceived lack of traditional values makes minorities seem younger

Does listening to true crime make you a more creative criminal?

Autism spectrum disorder is associated with specific congenital malformations

Study links internalized pornographic standards to body image issues among incel men

Listening to bad music makes you crave sugar, study finds

People remain “blissfully ignorant” of AI use in everyday messages, new research shows

Believing in a “chemical imbalance” might keep patients on antidepressants longer

Welcome Back!

Retrieve your password

Add New Playlist