Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

“Intersectional hallucinations”: The AI flaw that could lead to dangerous misinformation

by Ericka Johnson
August 27, 2024
in Artificial Intelligence
(Photo credit: DALL·E)

(Photo credit: DALL·E)

Share on TwitterShare on Facebook
Follow PsyPost on Google News

When you go to the hospital and get a blood test, the results are put in a dataset and compared with other patients’ results and population data. This lets doctors compare you (your blood, age, sex, health history, scans, etc) to other patients’ results and histories, allowing them to predict, manage and develop new treatments.

For centuries, this has been the bedrock of scientific research: identify a problem, gather data, look for patterns, and build a model to solve it. The hope is that Artificial Intelligence (AI) – the kind called Machine Learning that makes models from data – will be able to do this far more quickly, effectively and accurately than humans.

However, training these AI models needs a LOT of data, so much that some of it has to be synthetic – not real data from real people, but data that reproduces existing patterns. Most synthetic datasets are themselves generated by Machine Learning AI.

Wild inaccuracies from image generators and chatbots are easy to spot, but synthetic data also produces hallucinations – results that are unlikely, biased, or plain impossible. As with images and text, they can be amusing, but the widespread use of these systems in all areas of public life means that the potential for harm is massive.

What is synthetic data?

AI models need much more data than the real world can offer. Synthetic data provides a solution – generative AI examines the statistical distributions in a real dataset and creates a new, synthetic one to train other AI models.

This synthetic ‘pseudo’ data is similar but not identical to the original, meaning it can also ensure privacy, skirt data regulations, and be freely shared or distributed.

Synthetic data can also supplement real datasets, making them big enough to train an AI system. Or, if a real dataset is biased (has too few women, for example, or over-represents cardigans instead of pullovers), synthetic data can balance it out. There is ongoing debate around how far synthetic data can stray from the original.

Glaring omissions

Without proper curation, the tools that make synthetic data will always over-represent things that are already dominant in a dataset and under-represent (or even omit) less common ‘edge-cases’.

This was what initially sparked my interest in synthetic data. Medical research already under-represents women and other minorities, and I was concerned that synthetic data would exacerbate this problem. So, I teamed up with a machine learning scientist, Dr Saghi Hajisharif, to explore the phenomenon of disappearing edge-cases.

In our research, we used a type of AI called a GAN to create synthetic versions of 1990 US adult census data. As expected, edge-cases were missing in the synthetic datasets. In the original data we had 40 countries of origin, but in a synthetic version, there were only 31 – the synthetic data left out immigrants from 9 countries.

Once we knew about this error, we were able to tweak our methods and include them in a new synthetic dataset. It was possible, but only with careful curation.

‘Intersectional hallucinations’ – AI creates impossible data

We then started noticing something else in the data – intersectional hallucinations.

Intersectionality is a concept in gender studies. It describes power dynamics that produce discrimination and privilege for different people in different ways. It looks not just at gender, but also at age, race, class, disability, and so on, and how these elements ‘intersect’ in any situation.

This can inform how we analyse synthetic data – all data, not just population data – as the intersecting aspects of a dataset produce complex combinations of whatever that data is describing.

In our synthetic dataset, the statistical representation of separate categories was quite good. Age distribution, for example, was similar in the synthetic data to the original. Not identical, but close. This is good because synthetic data should be similar to the original, not reproduce it exactly.

Then we analysed our synthetic data for intersections. Some of the more complex intersections were being reproduced, too. For example, in our synthetic dataset, the intersection of age-income-gender was reproduced quite accurately. We called this accuracy ‘intersectional fidelity’.

But we also noticed the synthetic data had 333 datapoints labelled “husband/wife and single” – an intersectional hallucination. The AI had not learned (or been told) that this is impossible. Of these, over 100 datapoints were “never-married-husbands earning under 50,000 USD a year”, an intersectional hallucination that did not exist in the original data.

On the other hand, the original data included multiple “widowed females working in tech support”, but they were completely absent from the synthetic version.

This means that our synthetic dataset could be used for research on age-income-gender questions (where there was intersectional fidelity) but not if one were interested in “widowed females working in tech support”. And one should watch out for “never-married-husbands” in the results.

The big question is: where does this stop? These hallucinations are 2-part and 3-part intersections, but what about 4-part intersections? Or 5-part? At what point (and for what purposes) would the synthetic data become irrelevant, misleading, useless or dangerous?

Embracing intersectional hallucinations

Structured datasets exist because the relationships between the columns on a spreadsheet tell us something useful. Remember the blood test. Doctors want to know how your blood compares to normal blood, and to other diseases and treatment outcomes. That is why we organise data in the first place, and have done for centuries.

However, when we use synthetic data, intersectional hallucinations are always going to happen because the synthetic data must be slightly different to the original, otherwise it would simply be a copy of the original data. Synthetic data therefore requires hallucinations, but only the right kind – ones that amplify or expand the dataset, but do not create something impossible, misleading or biased.

The existence of intersectional hallucinations means that one synthetic dataset cannot work for lots of different uses. Each use-case will need bespoke synthetic datasets with labelled hallucinations, and this needs a recognised system.

Building reliable AI systems

For AI to be trustworthy, we have to know which intersectional hallucinations exist in its training data, especially when it is used to predict how people will act, or to regulate, govern, treat or police us. We need to ensure they are not trained on dangerous or misleading intersectional hallucinations – like a 6 year old medical doctor receiving pension payments.

But what happens when synthetic datasets are used carelessly? Right now there is no standard way to mark them, and they are often mixed up with real data. When a dataset is shared for others to use, it is impossible to know if it can be trusted, and to know what is a hallucination and what is not. We need clear, universally recognisable ways to identify synthetic data.

Intersectional hallucinations may not be as amusing as a hand with 15 fingers, or recommendations to put glue on a pizza. They are boring, unsexy numbers and statistics, but they will affect us all – sooner or later, synthetic data is going to spread everywhere, and it will always, by its very nature, contain intersectional hallucinations. Some we want, some we don’t, but the problem is telling them apart. We need to make this possible before it is too late.The Conversation

 

This article is republished from The Conversation under a Creative Commons license. Read the original article.

RELATED

Psilocybin-assisted group therapy may help reduce depression and burnout among healthcare workers
Artificial Intelligence

Just a few chats with a biased AI can alter your political opinions

October 4, 2025
AI chatbots often misrepresent scientific studies — and newer models may be worse
Artificial Intelligence

AI chatbots give inconsistent responses to suicide-related questions, study finds

September 29, 2025
Study reveals AI’s potential to detect loneliness by deciphering speech patterns
Artificial Intelligence

People are more likely to act dishonestly when delegating tasks to AI

September 26, 2025
Frequent AI chatbot use associated with lower grades among computer science students
Artificial Intelligence

Frequent AI chatbot use associated with lower grades among computer science students

September 24, 2025
Too much ChatGPT? Study ties AI reliance to lower grades and motivation
Artificial Intelligence

Managers who use AI to write emails seen as less sincere, caring, and confident

September 24, 2025
Daughters who feel more attractive report stronger, more protective bonds with their fathers
Artificial Intelligence

Personality traits predict students’ use of generative AI in higher education, study finds

September 22, 2025
New AI tool detects hidden consciousness in brain-injured patients by analyzing microscopic facial movements
Artificial Intelligence

New AI tool detects hidden consciousness in brain-injured patients by analyzing microscopic facial movements

September 20, 2025
Veterans who develop excessive daytime sleepiness face increased risk of death
Artificial Intelligence

Artificial intelligence reveals hidden facial cues of mild depression

September 18, 2025

STAY CONNECTED

LATEST

People are more likely to honk at bad drivers with political bumper stickers

Children with more autistic traits show increased vulnerability to PTSD in early adulthood

Study finds a synergy between caffeine and music for athletes

Blackcurrant juice increases blood flow in the brain’s prefrontal cortex

Surprising hormone found to protect male brains from stress

Childhood trauma appears to leave a lasting metabolic signature

Scientists studied ayahuasca users—what they found about death is stunning

Neuroscientists reveal five distinct sleep patterns linked to health and cognition

         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy