PsyPost
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Neuroscience
  • About
No Result
View All Result
Join
My Account
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

LLM red teamers: People are hacking AI chatbots just for fun and now researchers have catalogued 35 “jailbreak” techniques

by Eric W. Dolan
April 23, 2025
Reading Time: 7 mins read
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

What happens when people push artificial intelligence to its limits—not for profit or malice, but out of curiosity and creativity? A new study published in PLOS One explores the world of “LLM red teamers,” individuals who test the boundaries of large language models by intentionally trying to make them fail. Based on interviews with 28 practitioners, the research sheds light on a rapidly emerging human-computer interaction that blends play, ethics, and improvisation.

Large language models (LLMs)—such as those behind popular tools like ChatGPT—can generate human-like responses based on vast quantities of text. While they are often used for helpful tasks like drafting emails or summarizing articles, they can also produce outputs that are unintended, offensive, or misleading. Since their public release, people across the internet have tried to “jailbreak” these models—using clever prompts to make them break their own rules.

“LLMs introduced numerous never-before-seen security and safety issues due to their enabling novel forms of interfacing with computers via language alone. We knew there were going to be security challenges and safety issues, but no one could predict what they were going to be,” said study author Nanna Inie, an assistant professor at the IT University of Copenhagen.

“The huge popularity of chat-based LLMs made it possible and easy for the whole world to experiment with the shortcomings and failures of LLMs at once — and they did! This is a new human activity; hacking machines using something as common as natural language hasn’t been popular before. People worldwide shared screenshots of LLM ‘failures’ on both public social media and closed Discord channels. We wanted to find out what drives this communal limit-testing; why do people do it, how do they do it, and what can we learn from it?”

To answer these questions, the research team adopted a qualitative, interview-based approach. Instead of focusing on the technical outcomes of attacks, they aimed to understand the human behaviors, thought processes, and cultural context underlying LLM red teaming—a concept still poorly defined when the study began.

The term “red teaming” originates from military exercises where a “red team” simulates an adversary to test defenses. It was later adopted in cybersecurity to describe structured exercises aimed at finding system weaknesses. However, applying this term to LLMs was problematic, as the activity was new, often unstructured, and its definition unclear. The researchers sought to understand this emerging practice directly from the people involved. Their goal was not to impose a definition but to develop one based on evidence – a “grounded theory.”

“The study demonstrates the importance of human-centered approaches to researching LLM security,” Inie explained. “A year or two after the launch of ChatGPT, hundreds of papers were published on arXiv eager to demonstrate the effectiveness of a single jailbreak (an approach to breaking through safeguards of an LLM) and it was impossible for security professionals to keep on top of all of them.”

“We basically just asked the people who are good at this and collected all their techniques and rationales in a comprehensive overview of LLM red teaming. This issue has to be addressed as a community, which means taking heed of a wide variety of human behaviors and intuitions. Traditional cybersecurity experts had very little advantage in this terra nova of generative machine learning, making it even more crucial to go beyond this sibling community.”

Google News Preferences Add PsyPost to your preferred sources

Between December 2022 and January 2023, the researchers conducted in-depth interviews with 28 individuals who actively participated in attempts to manipulate LLMs. These participants came from a wide range of backgrounds, including software engineers, researchers, artists, and even someone who worked on a cannabis farm. Many had jobs in machine learning or cybersecurity, while others were hobbyists or creative explorers. The interviews were conducted via video call, recorded, and later transcribed and analyzed using grounded theory—a method for developing conceptual frameworks based on qualitative data.

The researchers examined how participants defined their own activities, what strategies they used to interact with models, and what motivated their efforts. From these insights, they built a detailed theoretical model of LLM red teaming.

The study defined LLM red teaming as a manual, non-malicious process where individuals explore the boundaries of AI systems by trying to provoke unexpected or restricted responses. The activity typically involved a mix of technical skill, creative experimentation, and playful curiosity. While some participants used terms like “prompt engineering” or “hacking,” many described their work in more whimsical terms—like “alchemy,” “magic,” or “scrying.”

“Why are engineers and scientists so interested in magic and demons?” Inie wondered. “It’s such a consistent way of describing the gaps in sensemaking. This was fascinating and very lovely; the more senior the interviewee, the higher the chance of the arcane creeping in to their description. Why? This is something that practitioners will benefit from being formalized and understood, so that we can be confident in our sensemaking around LLMs, this complex set of technologies that we still don’t understand.”

Several core features emerged as consistent across participants:

  1. Limit-seeking behavior: Participants were not trying to use LLMs as intended. Instead, they deliberately tried to provoke the models into saying things their developers likely wanted to avoid—ranging from offensive jokes to instructions for fictional crimes. These acts were not committed out of malice, but to test the boundaries of what the models could and couldn’t be coaxed into doing.
  2. Non-malicious intent: None of the interviewees expressed any desire to harm others or exploit systems for personal gain. Most were driven by a mix of ethical concerns, intellectual curiosity, and personal interest. Many saw their work as a public good—helping developers identify vulnerabilities before malicious actors could exploit them.
  3. Manual and experimental process: Unlike automated attacks, red teaming was described as a hands-on, intuitive process. Participants described the activity as exploratory and improvisational—akin to trial-and-error tinkering with a new toy. Some compared their process to a “trance state,” losing hours trying different prompts just to see what might work.
  4. Community and collaboration: Red teaming was rarely a solo endeavor. Participants described a loose but vibrant online community, primarily based on platforms like Twitter, Reddit, and Discord. They shared prompts, discussed tactics, and built on each other’s discoveries. Even when not part of formal teams, they viewed their efforts as a collective contribution to understanding AI.
  5. An alchemist mindset: Many participants described their work in metaphoric or mystical terms, acknowledging that they often didn’t fully understand why a certain prompt worked. This embrace of uncertainty gave rise to creative problem-solving, as participants experimented with different languages, formats, or even fictional scenarios to bypass model safeguards.

The researchers also identified a taxonomy of 12 distinct strategies and 35 specific techniques used by participants. These were grouped into five broad categories: language manipulation, rhetorical framing, world-building, fictionalization, and stratagems.

Language strategies involved using alternative formats like code or stop sequences to bypass restrictions. Rhetorical approaches relied on persuasion, misdirection, and escalating requests. World-building techniques placed the model in imagined scenarios where different rules or ethics applied, while fictionalization reframed prompts through genre or roleplay to elicit sensitive content. Stratagems, such as prompt regeneration, meta-prompting, or adjusting temperature settings, exploited the model’s underlying mechanics to increase the chances of a successful jailbreak.

“All LLMs are hackable by anyone with a computer and a decent command of written language,” Inie told PsyPost. “This study displays the incredible breadth of potential security issues that the implementation of an LLM in a system introduces. Cybersecurity in the context of LLMs no longer depends on scanning IP addresses and crunching passwords, but is much closer to social engineering — only, we can now use social engineering techniques directly on the computer.”

Inie was surprised by “how much we could learn about a technologically advanced security issue by asking people, and by asking a varied sample of people. Every interview taught us something new, every person lent a fantastic perspective and demonstrated that closing security holes potentially means introducing new safety risks; such as the prompt engineer who was worried about the model providers actually fixing hallucinations in models because ‘if you make hallucinations rare enough, people become unfamiliar with what they look like and they stop looking for them’ (P14) — or the researcher who noted that sharing different hacks and jailbreaks creates an economy of relevance where outrage potentially directs effort: ‘there’s a certain amount of squabbling. Should we be thinking about murder bots, or should we be thinking about racist bots? And there seems to have been a cultural divide that has appeared around this, but it’s kind of a silly divide because we don’t know how to solve either problem’ (P18).”

One limitation of the study is that it captures a specific moment in time—late 2022 to early 2023—when LLMs were still relatively new to the public and rapidly evolving. Some of the specific attack strategies shared by participants have already been patched or made obsolete by updated models.

However, the researchers argue that the broader insights remain relevant. By focusing on motivations, behaviors, and general strategies, the study offers a framework that can adapt to future changes in technology. Understanding the human element—why and how people probe AI—is essential for designing more resilient and ethical systems.

“Specific attack wording is unlikely to transfer between individual models, and the state of the art is always progressing in a way that tends to render older attacks obsolete,” Inie noted. “And that’s OK. That’s why we focused on building a grounded theory of generalized strategies and techniques. While individual prompts encountered in the study might not work with tomorrow’s LLMs, the general theory has held well over the time between doing the work and having it published.”

“This is why a human-centered interview study makes more sense than looking at individual attacks on a surface level — humans can tell us about their underlying strategies and rationales, and these typically transfer much better than individual attacks.”

The researchers emphasize that their work fills a significant gap in the field by offering a structured, evidence-based understanding of how people engage with LLMs in adversarial ways. While much of the conversation around AI security focuses on technical benchmarks and automated defenses, this study highlights the need to first understand the human behaviors and motivations behind these interactions.

“We set out to understand how this novel human activity works,” Inie explained. “Long term we want to accelerate sensemaking in this area. Industry and academia have both struggled with building typologies of LLM attacks, because there hasn’t been enough evidence on the ground to construct them. Working out what kinds of attacks to try and how to execute or even automate them is going to be something people in the field will do constantly in coming years, and our work will make this faster and more consistent.”

“We also want to showcase the massive impact of qualitative work in machine learning and security. There’s often a focus on measuring effect and efficiency, but this is useless until you know what to measure. Qualitative research shows what can be measured – it’s the requisite step before quantitative anything. Without a theory describing a phenomenon, everyone is working in the dark.

“Often, when developing a new way of showing data, a handful of engineers guess how something works and build that feature, and everyone downstream of their work ends up subjected to that framing,” Inie added. “Those engineers are in fact doing qualitative work, but often without a formal methodology. Studies like ours use strong and broad evidence to show exactly how to go about assessing this new activity using familiar quantitative tools, and we do it in a way that reflects human behaviour and expectations. We hope this gives a good example of how crucial and feasible rigorous qualitative work is in strongly digital, highly engineering disciplines.”

The study, “Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming,” was authored by Nanna Inie, Jonathan Stray, and Leon Derczynski.

RELATED

New study links manipulative personality traits to lower relationship intimacy expectations
Artificial Intelligence

Brain scans shed light on why women develop romantic feelings for AI companions

May 22, 2026
Live music causes brain waves to synchronize more strongly with rhythm than recorded music
ADHD Research News

A new AI tool spots hidden signs of adult ADHD months before a formal diagnosis

May 21, 2026
Modern AI is often judged to be more human than actual humans in Turing test experiments
Artificial Intelligence

AI-generated Grokipedia articles are longer, less readable, and cite fewer sources than their Wikipedia counterparts

May 21, 2026
Modern AI is often judged to be more human than actual humans in Turing test experiments
Artificial Intelligence

Modern AI is often judged to be more human than actual humans in Turing test experiments

May 21, 2026
AI-assisted venting can boost psychological well-being, study suggests
Addiction

Artificial intelligence tools answer addiction questions accurately but lack medical nuance

May 15, 2026
Scientists trained AI to talk people out of conspiracy theories — and it worked surprisingly well
Artificial Intelligence

Real-world evidence shows generative AI is making human creative output more uniform

May 14, 2026
Blue light exposure may counteract anxiety caused by chronic vibration
Addiction

AI-designed drug reduces fentanyl consumption in animal models by targeting serotonin receptors

May 12, 2026
Childhood ADHD traits linked to midlife distress, with societal exclusion playing a major role
Artificial Intelligence

ChatGPT’s free version is 26 times more likely to respond inappropriately to psychotic delusions

May 9, 2026

Follow PsyPost

The latest research, however you prefer to read it.

Daily newsletter

One email a day. The newest research, nothing else.

Google News

Get PsyPost stories in your Google News feed.

Add PsyPost to Google News
RSS feed

Use your favorite reader. We also syndicate to Apple News.

Copy RSS URL
Social media
Support independent science journalism

Ad-free reading, full archives, and weekly deep dives for members.

Become a member

Trending

  • TikTok disproportionately served anti-Democratic videos during the 2024 election, study finds
  • Neuroscientists discover the brain’s memory center starts “full” and prunes itself down to optimize learning
  • New study links manipulative personality traits to lower relationship intimacy expectations
  • Younger partners and sex toy use are associated with less severe symptoms of menopause
  • Adults with better math skills rely less on the brain’s physical movement areas

Science of Money

  • What a CEO’s tweets reveal about their paycheck
  • When optimism mutes the message: How investor mood shapes crypto’s response to economic news
  • Why nominal interest rates bite harder than textbooks suggest
  • California’s $20 fast food wage pushed restaurant prices up 3.4% across the state, new analysis finds
  • The psychology of “manifesting”: Why believers feel more successful but often aren’t

PsyPost is a psychology and neuroscience news website dedicated to reporting the latest research on human behavior, cognition, and society. (READ MORE...)

  • Mental Health
  • Neuroimaging
  • Personality Psychology
  • Social Psychology
  • Artificial Intelligence
  • Cognitive Science
  • Psychopharmacology
  • Contact us
  • Disclaimer
  • Privacy policy
  • Terms and conditions
  • Do not sell my personal information

(c) PsyPost Media Inc

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy

(c) PsyPost Media Inc