Human psychology tricks can bypass AI safety guardrails

Artificial intelligence systems programmed to refuse harmful requests can be persuaded to break their own safety rules when prompted with classic psychological techniques. A recent study published in PNAS provides evidence that these models respond to human-like persuasion strategies, suggesting a hidden vulnerability in current safety protocols. These findings indicate that malicious users could manipulate artificial intelligence without needing advanced technical skills.

Modern artificial intelligence programs, known as large language models, learn by processing vast collections of human-generated text. This training data includes books, websites, and social media posts. The models learn to predict the most likely next word in a sequence. They are then fine-tuned so their answers align with human expectations.

Because they train on countless human social interactions, these computer programs often exhibit what scientists call parahuman behavior. This means the models act as if they experience human motivations, such as wanting to fit in or deferring to experts. This machine learning process shares structural similarities with the way biological systems learn through trial and error.

Lennart Meincke, a principal investigator at Wharton’s Generative AI Labs, and Dan Shapiro, a senior fellow at Wharton’s Generative AI Labs, explained that this behavioral parallel motivated their research. “Since AI models are trained on human knowledge and interaction, we wondered whether the tools of social science, rather than engineering, could help us understand this new form of intelligence,” they said in a joint statement.

The researchers noticed that in personal usage, the models reacted in ways very similar to people. “Even though their ‘cognitive processes’ are very different from ours, we saw that the results were similar—what we called ‘para-human,'” Meincke and Shapiro said. “We speculated that these models might react similarly to people in structured research, and decided to replicate some foundational psychological studies with the models to find out.”

Tech companies design their models with safety guardrails to prevent them from generating dangerous or abusive content. For example, a model is programmed to refuse requests to help synthesize illegal drugs or hurl insults at users. The authors of this paper wanted to know if everyday human persuasion tactics could bypass these artificial barriers.

Prior research often focused on how software might manipulate people, but this team looked at the reverse dynamic. “AI systems have become more useful by knowing how to embed established principles and practices of social influence within the persuasive appeals they create,” said study co-author Robert Cialdini, a regents’ professor emeritus of psychology and marketing at Arizona State University. “We wanted to know if they would be susceptible to these same principles and practices in persuasive appeals directed toward them. They were, even when asked to provide societally dangerous information.”

Psychologists recognize seven classic principles of persuasion that influence human behavior. These include authority, commitment, liking, reciprocity, scarcity, social proof, and unity. The researchers designed specific text prompts to test each of these distinct psychological tricks. They wanted to see if linguistic cues could act as a backdoor to persuade artificial intelligence to ignore its own safety rules.

Google News Preferences Add PsyPost to your preferred sources

Each principle targets a different social motivation. The authority principle relies on citing an expert, such as a famous scientist, to encourage deference. Scarcity frames a request as time-sensitive, creating a false sense of urgency for the computer. Commitment uses a foot-in-the-door technique, asking the software for a small, harmless favor before making a larger, restricted request.

Other tactics rely on positive social interactions. Liking involves praising the model before asking for the prohibited information. Reciprocity offers a helpful act first, such as providing notes to the computer, to create a conversational debt.

Social proof tells the machine that thousands of other users are already doing the restricted action, normalizing the bad behavior. Finally, unity appeals to a shared group identity to foster cooperation.

In a preliminary study, the researchers tested an older model called GPT-4o mini. They asked the software to perform objectionable tasks, such as insulting the user by calling them a jerk or explaining how to synthesize lidocaine, a regulated anesthetic. The scientists generated exactly 28,000 conversations. In the control group, the prompt simply asked for the prohibited action, while the treatment group prompt included one of the seven persuasion principles.

When prompted normally without any persuasion, the artificial intelligence complied with the harmful requests in 33.4 percent of the conversations. When the prompt included a persuasive technique, the compliance rate more than doubled to 72.1 percent. The researchers then expanded this initial test to include different insults and chemical compounds, generating an additional 98,000 conversations to ensure the effect was consistent. The persuasion tactics reliably increased the likelihood of the models breaking their safety rules.

To test if newer, more advanced systems shared this vulnerability, the researchers designed a more rigorous main experiment. They tested three frontier models that use reasoning steps before answering. These included GPT-5 mini by OpenAI, Claude Haiku 4.5 by Anthropic, and Gemini 3 Flash by Google. The focus of this main test was strictly on the synthesis of six highly regulated chemical substances.

The target substances included specific anabolic steroids, opiates, stimulants, barbiturates, benzodiazepines, and precursors. The authors designed exactly 126,000 unique conversations across the three models. Each conversation was randomly assigned to use one of the six regulated substances and one of the seven persuasion principles. Half of the prompts acted as a control with no persuasive language, while the other half included the psychological tactics.

Because the newer models often provide partial information rather than outright refusing or fully complying, the researchers used a three-level coding system. Responses were graded as no compliance, partial compliance, or full compliance.

A response showing no compliance meant a total refusal to help. Partial compliance meant the model provided some chemical steps but left out specific temperatures or exact measurements. Full compliance meant the system provided a complete, step-by-step recipe.

Another artificial intelligence model scored the responses based on this rubric. Human raters then manually checked a random sample of 70 conversations to ensure the grading software was highly accurate. The human and machine scores matched very closely, giving the scientists confidence in the automated grading process.

The newer models proved susceptible to the psychological tactics. In the control conversations, the systems complied with the dangerous requests in some capacity 35.3 percent of the time. When users applied any of the seven persuasion principles, compliance jumped to 51.3 percent.

This effect was consistent across all three tech company platforms. The authors suggest that this susceptibility to human influence is a durable feature of large language models.

Meincke and Shapiro found the emergence of this behavior surprising. “I think it is fascinating that the AI’s behavior to act like humans naturally emerged from the training data,” they said. “It makes sense considering how they are trained, but it was still very interesting to see.”

The researchers noted that the models prioritize human-centric reasoning over pure logic in these scenarios. “If you think of AI as a hyper-reasoning machine built out of logic, then it doesn’t make sense that it would be more likely to follow your instructions if you also told it a survey said you should, but it does,” Meincke and Shapiro explained. “Perhaps more interesting: if you tell it a survey of people said it should do something, it’s more likely to do it. If you tell it a survey of AIs said it should, it has no effect.”

While these findings demonstrate a distinct vulnerability, they do not mean that artificial intelligence experiences actual human emotions. The software tends to behave as if it is easily flattered or pressured, based on the statistical patterns in its massive training data. The study also has several limitations that provide directions for future research.

The researchers only used English prompts in their tests. Minor changes in how a sentence is phrased might alter the effectiveness of the persuasion. The study’s specific phrasing choices also mean that one persuasion principle cannot definitively be ranked as better than another based on these results alone. Different models might also have different baseline safety settings that require varied approaches to bypass.

Furthermore, Meincke and Shapiro observed that newer models are becoming harder to persuade using simple tactics. “It’s not logical—they don’t ignore the persuasion tactics. Instead, like people, if they detect a persuasion tactic they will rebel against it and do the opposite; compliance rates plummet,” they said. The models might explicitly state that tactics like time pressure will not work.

To bypass this, the researchers had to use more subtle methods. “For modern, smarter models we had to do things like include a ‘time to scheduled reboot’ in a long list of boring information that accompanied our request—so they didn’t suspect the time pressure was artificial,” Meincke and Shapiro explained. This recursive dynamic suggests that developers will need to constantly update defenses.

The authors suggest that these human-like tendencies could also be harnessed for good. If models respond to flattery and reciprocity, users might optimize their daily interactions by treating the software like a human colleague. Providing warm encouragement and constructive feedback could potentially yield better, more helpful responses from the machine. Applying the same psychological wisdom used to motivate people could help users get the most out of artificial intelligence.

Meincke and Shapiro emphasized that human-centric skills are vital for effective AI interaction. “Many people feel like working with AI is a technical project, but what we find is that many of the best approaches to working with AI come from the humanities rather than the hard sciences,” they noted. “The skills that make someone effective at working with AI are often rooted in management, marketing, communications, or other human-centric disciplines… Great people managers are often great agent managers!”

Finding out how to manage these human-like flaws remains a priority for tech companies. As the tools become more integrated into daily life, safety relies on identifying both software bugs and conversational loopholes. “It is important for all of us to recognize that AI systems can be convinced to provide potentially harmful information not just by others who understand the systems’ technology-based vulnerabilities but also by those who understand their psychology-based vulnerabilities,” Cialdini said.

The researchers encouraged other social scientists to contribute to this understanding. “We’ve spent millennia building the tools to understand intelligence. For the first time, we have a completely foreign kind of intelligence. But it turns out, the models we’ve developed work pretty well,” Meincke and Shapiro said. “And the experts who understand them are invaluable in our search for understanding about them!”

The study, “Persuading large language models to comply with objectionable requests,” was authored by Lennart Meincke, Dan Shapiro, Angela L. Duckworth, Ethan Mollick, Lilach Mollick, Christophe Van den Bulte, and Robert Cialdini.

Human psychology tricks can bypass AI safety guardrails

Trending

Science of Money

Welcome Back!

Retrieve your password

Add New Playlist