Scientists tested the creativity of AI models, and the results were surprisingly homogeneous

A recent study published in PNAS Nexus suggests that while artificial intelligence chatbots can match or exceed human creativity on individual tasks, they produce highly similar responses when compared to one another. This provides evidence that widespread reliance on artificial intelligence for creative tasks could lead to a loss of unique ideas.

Scientists Emily Wenger and Yoed N. Kenett designed this study to understand how large language models affect the diversity of human thought. Large language models are the technology behind popular AI chatbots that predict and generate text based on user prompts.

Large language models are complex computer programs designed to process and produce human language. Developers build these systems by training them on billions of sentences from books, articles, and websites. By analyzing this vast amount of text, the models learn the mathematical patterns and relationships between words.

When a user gives a chatbot a prompt, the model works by calculating the most probable next word in a sequence. It builds responses one word at a time based on the rules and associations it learned during its training phase. Wenger suspected this shared training method across different systems might cause a broader issue.

“Most of today’s LLMs are trained on massive datasets of scraped internet data — which functionally means they’re all trained on roughly the same data,” said Wenger, Cue Family Assistant Professor at Duke University. Traditional machine learning research has shown that training models on the same dataset lead to models with similar properties. I was wondering if this phenomenon would occur in commercial LLMs and what the implications might be.”

To investigate this, the researchers recruited 102 human participants through Prolific, an online platform for survey research. They screened the human participants to remove computer bots and ensure everyone passed basic attention checks. They also selected 22 different language models from various companies, including well-known chatbots produced by Google, Meta, and OpenAI.

Both humans and language models completed three standard verbal creativity tasks. The first was the Alternative Uses Task, which asks participants to list as many creative uses as possible for everyday objects like a fork, a book, or a pair of pants. This assessment tests divergent thinking, which is the ability to generate multiple unique solutions to a single problem.

The second assessment was the Forward Flow task, which measures associative thinking. Participants receive a starting word, like “snow” or “candle,” and must provide a chain of up to 20 subsequent words that naturally follow in their minds. Associative thinking helps individuals search through their memories and combine different concepts into new ideas.

Google News Preferences Add PsyPost to your preferred sources

The final assessment was the Divergent Association Task. This exercise required participants to generate 10 nouns that are as unrelated to one another as possible. Generating unrelated words demonstrates a cognitive flexibility that is strongly linked to creative abilities in humans.

The scientists then used computational text-analysis tools to evaluate the responses. These tools embed words into a mathematical space to measure the semantic distance between them, which calculates how different words and concepts are from one another. The researchers measured both the individual originality of a single answer and the overall variability among all answers in a group.

The researchers found that individual language models performed at or slightly above the level of the average human on most of the tasks. When looking at a single response in isolation, the chatbots provided highly original answers. However, a pattern of similarity emerged when the scientists compared all the responses from the different models to one another.

Across all tasks, the models produced answers that were significantly more alike than the answers provided by humans. The chatbots frequently relied on the same overlapping vocabulary, causing their creative outputs to group together in a highly uniform way. This similarity was even more pronounced when the researchers compared models built by the same company.

“My hypothesis was that there would be some degree of homogeneity among LLM responses relative to humans, but I was surprised by the degree.”

Wenger and Kenett also tested whether they could force the models to be more diverse. They adjusted the “temperature” setting on the models, which is a mechanism that controls the level of randomness in the text generation process. Low temperatures produce highly predictable text, while high temperatures introduce more random word choices.

While increasing the randomness did make the responses more varied, it quickly caused the models to produce nonsensical gibberish. These random responses no longer fulfilled the basic requirements of the creative prompts. True creativity requires an idea to be both novel and appropriate for the situation, so generating gibberish does not count as a successful creative output.

The researchers also tried changing the initial instructions given to the models. They explicitly commanded the chatbots to act as creative assistants and provide bold, outside-the-box answers. This minorly improved individual originality but completely failed to fix the broader issue of uniformity, as the models still produced responses similar to one another.

These findings suggest that relying on generative AI for brainstorming or problem-solving could limit the scope of human creativity. If everyone uses these tools to help write drafts or generate ideas, society might see a massive narrowing of concepts.

“If you are using AI chatbots (which are built on LLMs) for creative tasks, know that the results you get from these models will likely look very similar to the results someone else would get from an AI chatbot, even if it’s different from the one you used. If you want your content to be truly unique, probably shy away from using an AI chatbot to generate it.”

The researchers note some potential misinterpretations and limitations of their work. The study only measured performance on specific verbal creativity tasks, which means the results might not apply to all forms of creative behavior. For example, language models might not show the same homogenization when asked to perform generic non-verbal tasks like drawing or composing music.

Additionally, the scientists only tested commercially available models that have been programmed to follow strict safety and conversational guidelines. This safety training is known to affect how models behave in experimental settings. It is possible that raw, unaligned models might display different creative properties, though most everyday users do not have access to these raw versions.

Future research will need to explore other dimensions of creativity, such as fluency and flexibility, rather than just originality. Fluency refers to the sheer number of ideas generated, while flexibility refers to the variety of categories those ideas cover. The scientists also hope to investigate the extent of this homogenization across other types of artificial intelligence and explore potential engineering solutions to mitigate the problem.

The study, “Large language models are homogeneously creative,” was authored by Emily Wenger and Yoed N. Kenett.

Scientists tested the creativity of AI models, and the results were surprisingly homogeneous

RELATED

AI chatbots fail medical misinformation test, returning inaccurate and fabricated advice

Irregular brain maturation in childhood predicts emotional habits in early adolescence

New research reveals how humans judge the moral minds of artificial intelligence

Training AI chatbots to be warm and empathetic makes them less factually accurate

Machine learning uncovers how childhood trauma amplifies genetic risks for depression

A new study mapped 350,000 relationship stories and found a communication style AI struggles to copy

Brain scans shed light on why women develop romantic feelings for AI companions

A new AI tool spots hidden signs of adult ADHD months before a formal diagnosis

Trending

Science of Money

Welcome Back!

Retrieve your password

Add New Playlist