LLM red teamers: People are hacking AI chatbots just for fun and now researchers have catalogued 35 "jailbreak" techniques

Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.

What happens when people push artificial intelligence to its limits—not for profit or malice, but out of curiosity and creativity? A new study published in PLOS One explores the world of “LLM red teamers,” individuals who test the boundaries of large language models by intentionally trying to make them fail. Based on interviews with 28 practitioners, the research sheds light on a rapidly emerging human-computer interaction that blends play, ethics, and improvisation.

Large language models (LLMs)—such as those behind popular tools like ChatGPT—can generate human-like responses based on vast quantities of text. While they are often used for helpful tasks like drafting emails or summarizing articles, they can also produce outputs that are unintended, offensive, or misleading. Since their public release, people across the internet have tried to “jailbreak” these models—using clever prompts to make them break their own rules.

“LLMs introduced numerous never-before-seen security and safety issues due to their enabling novel forms of interfacing with computers via language alone. We knew there were going to be security challenges and safety issues, but no one could predict what they were going to be,” said study author Nanna Inie, an assistant professor at the IT University of Copenhagen.

“The huge popularity of chat-based LLMs made it possible and easy for the whole world to experiment with the shortcomings and failures of LLMs at once — and they did! This is a new human activity; hacking machines using something as common as natural language hasn’t been popular before. People worldwide shared screenshots of LLM ‘failures’ on both public social media and closed Discord channels. We wanted to find out what drives this communal limit-testing; why do people do it, how do they do it, and what can we learn from it?”

To answer these questions, the research team adopted a qualitative, interview-based approach. Instead of focusing on the technical outcomes of attacks, they aimed to understand the human behaviors, thought processes, and cultural context underlying LLM red teaming—a concept still poorly defined when the study began.

The term “red teaming” originates from military exercises where a “red team” simulates an adversary to test defenses. It was later adopted in cybersecurity to describe structured exercises aimed at finding system weaknesses. However, applying this term to LLMs was problematic, as the activity was new, often unstructured, and its definition unclear. The researchers sought to understand this emerging practice directly from the people involved. Their goal was not to impose a definition but to develop one based on evidence – a “grounded theory.”

“The study demonstrates the importance of human-centered approaches to researching LLM security,” Inie explained. “A year or two after the launch of ChatGPT, hundreds of papers were published on arXiv eager to demonstrate the effectiveness of a single jailbreak (an approach to breaking through safeguards of an LLM) and it was impossible for security professionals to keep on top of all of them.”

“We basically just asked the people who are good at this and collected all their techniques and rationales in a comprehensive overview of LLM red teaming. This issue has to be addressed as a community, which means taking heed of a wide variety of human behaviors and intuitions. Traditional cybersecurity experts had very little advantage in this terra nova of generative machine learning, making it even more crucial to go beyond this sibling community.”

Between December 2022 and January 2023, the researchers conducted in-depth interviews with 28 individuals who actively participated in attempts to manipulate LLMs. These participants came from a wide range of backgrounds, including software engineers, researchers, artists, and even someone who worked on a cannabis farm. Many had jobs in machine learning or cybersecurity, while others were hobbyists or creative explorers. The interviews were conducted via video call, recorded, and later transcribed and analyzed using grounded theory—a method for developing conceptual frameworks based on qualitative data.

The researchers examined how participants defined their own activities, what strategies they used to interact with models, and what motivated their efforts. From these insights, they built a detailed theoretical model of LLM red teaming.

The study defined LLM red teaming as a manual, non-malicious process where individuals explore the boundaries of AI systems by trying to provoke unexpected or restricted responses. The activity typically involved a mix of technical skill, creative experimentation, and playful curiosity. While some participants used terms like “prompt engineering” or “hacking,” many described their work in more whimsical terms—like “alchemy,” “magic,” or “scrying.”

“Why are engineers and scientists so interested in magic and demons?” Inie wondered. “It’s such a consistent way of describing the gaps in sensemaking. This was fascinating and very lovely; the more senior the interviewee, the higher the chance of the arcane creeping in to their description. Why? This is something that practitioners will benefit from being formalized and understood, so that we can be confident in our sensemaking around LLMs, this complex set of technologies that we still don’t understand.”

Several core features emerged as consistent across participants:

Limit-seeking behavior: Participants were not trying to use LLMs as intended. Instead, they deliberately tried to provoke the models into saying things their developers likely wanted to avoid—ranging from offensive jokes to instructions for fictional crimes. These acts were not committed out of malice, but to test the boundaries of what the models could and couldn’t be coaxed into doing.
Non-malicious intent: None of the interviewees expressed any desire to harm others or exploit systems for personal gain. Most were driven by a mix of ethical concerns, intellectual curiosity, and personal interest. Many saw their work as a public good—helping developers identify vulnerabilities before malicious actors could exploit them.
Manual and experimental process: Unlike automated attacks, red teaming was described as a hands-on, intuitive process. Participants described the activity as exploratory and improvisational—akin to trial-and-error tinkering with a new toy. Some compared their process to a “trance state,” losing hours trying different prompts just to see what might work.
Community and collaboration: Red teaming was rarely a solo endeavor. Participants described a loose but vibrant online community, primarily based on platforms like Twitter, Reddit, and Discord. They shared prompts, discussed tactics, and built on each other’s discoveries. Even when not part of formal teams, they viewed their efforts as a collective contribution to understanding AI.
An alchemist mindset: Many participants described their work in metaphoric or mystical terms, acknowledging that they often didn’t fully understand why a certain prompt worked. This embrace of uncertainty gave rise to creative problem-solving, as participants experimented with different languages, formats, or even fictional scenarios to bypass model safeguards.

The researchers also identified a taxonomy of 12 distinct strategies and 35 specific techniques used by participants. These were grouped into five broad categories: language manipulation, rhetorical framing, world-building, fictionalization, and stratagems.

Language strategies involved using alternative formats like code or stop sequences to bypass restrictions. Rhetorical approaches relied on persuasion, misdirection, and escalating requests. World-building techniques placed the model in imagined scenarios where different rules or ethics applied, while fictionalization reframed prompts through genre or roleplay to elicit sensitive content. Stratagems, such as prompt regeneration, meta-prompting, or adjusting temperature settings, exploited the model’s underlying mechanics to increase the chances of a successful jailbreak.

“All LLMs are hackable by anyone with a computer and a decent command of written language,” Inie told PsyPost. “This study displays the incredible breadth of potential security issues that the implementation of an LLM in a system introduces. Cybersecurity in the context of LLMs no longer depends on scanning IP addresses and crunching passwords, but is much closer to social engineering — only, we can now use social engineering techniques directly on the computer.”

Inie was surprised by “how much we could learn about a technologically advanced security issue by asking people, and by asking a varied sample of people. Every interview taught us something new, every person lent a fantastic perspective and demonstrated that closing security holes potentially means introducing new safety risks; such as the prompt engineer who was worried about the model providers actually fixing hallucinations in models because ‘if you make hallucinations rare enough, people become unfamiliar with what they look like and they stop looking for them’ (P14) — or the researcher who noted that sharing different hacks and jailbreaks creates an economy of relevance where outrage potentially directs effort: ‘there’s a certain amount of squabbling. Should we be thinking about murder bots, or should we be thinking about racist bots? And there seems to have been a cultural divide that has appeared around this, but it’s kind of a silly divide because we don’t know how to solve either problem’ (P18).”

One limitation of the study is that it captures a specific moment in time—late 2022 to early 2023—when LLMs were still relatively new to the public and rapidly evolving. Some of the specific attack strategies shared by participants have already been patched or made obsolete by updated models.

However, the researchers argue that the broader insights remain relevant. By focusing on motivations, behaviors, and general strategies, the study offers a framework that can adapt to future changes in technology. Understanding the human element—why and how people probe AI—is essential for designing more resilient and ethical systems.

“Specific attack wording is unlikely to transfer between individual models, and the state of the art is always progressing in a way that tends to render older attacks obsolete,” Inie noted. “And that’s OK. That’s why we focused on building a grounded theory of generalized strategies and techniques. While individual prompts encountered in the study might not work with tomorrow’s LLMs, the general theory has held well over the time between doing the work and having it published.”

“This is why a human-centered interview study makes more sense than looking at individual attacks on a surface level — humans can tell us about their underlying strategies and rationales, and these typically transfer much better than individual attacks.”

The researchers emphasize that their work fills a significant gap in the field by offering a structured, evidence-based understanding of how people engage with LLMs in adversarial ways. While much of the conversation around AI security focuses on technical benchmarks and automated defenses, this study highlights the need to first understand the human behaviors and motivations behind these interactions.

“We set out to understand how this novel human activity works,” Inie explained. “Long term we want to accelerate sensemaking in this area. Industry and academia have both struggled with building typologies of LLM attacks, because there hasn’t been enough evidence on the ground to construct them. Working out what kinds of attacks to try and how to execute or even automate them is going to be something people in the field will do constantly in coming years, and our work will make this faster and more consistent.”

“We also want to showcase the massive impact of qualitative work in machine learning and security. There’s often a focus on measuring effect and efficiency, but this is useless until you know what to measure. Qualitative research shows what can be measured – it’s the requisite step before quantitative anything. Without a theory describing a phenomenon, everyone is working in the dark.

“Often, when developing a new way of showing data, a handful of engineers guess how something works and build that feature, and everyone downstream of their work ends up subjected to that framing,” Inie added. “Those engineers are in fact doing qualitative work, but often without a formal methodology. Studies like ours use strong and broad evidence to show exactly how to go about assessing this new activity using familiar quantitative tools, and we do it in a way that reflects human behaviour and expectations. We hope this gives a good example of how crucial and feasible rigorous qualitative work is in strongly digital, highly engineering disciplines.”

The study, “Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming,” was authored by Nanna Inie, Jonathan Stray, and Leon Derczynski.

LLM red teamers: People are hacking AI chatbots just for fun and now researchers have catalogued 35 “jailbreak” techniques

RELATED

Scientists demonstrate that “AI’s superhuman persuasiveness is already a reality”

Trump’s speeches stump AI: Study reveals ChatGPT’s struggle with metaphors

People who use AI may pay a social price, according to new psychology research

Scientists use deep learning to uncover hidden motor signs of neurodivergence

Positive attitudes toward AI linked to problematic social media use

Dark personality traits linked to generative AI use among art students

New research reveals hidden biases in AI’s moral advice

ChatGPT and “cognitive debt”: New study suggests AI might be hurting your brain’s ability to think

SUBSCRIBE

STAY CONNECTED

LATEST

Psychopathic traits linked to distinct brain networks in new neuroscience research

Attention deficits may linger for months in COVID-19 survivors, even after physical recovery

Romantic breakups can trigger trauma-like brain activity in young adults

Disgust toward meat may be a relic of our evolutionary past

Surprisingly strong link found between a woman’s address and her memory decline

Scientists reveal a widespread but previously unidentified psychological phenomenon

Dopamine’s stronghold is the striatum, not the cortex, brain imaging study suggests

Brain injuries linked to criminal behavior highlight importance of white matter tract damage

Welcome Back!

Retrieve your password

Add New Playlist