Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI vision: GPT-4V shows human-like ability to interpret social scenes, study finds

by Eric W. Dolan
September 5, 2025
in Artificial Intelligence
[Adobe Stock]

[Adobe Stock]

Share on TwitterShare on Facebook

A new study published in Imaging Neuroscience has found that large language models with visual processing abilities, such as GPT-4V, can evaluate and describe social interactions in images and short videos in a way that closely matches human perception. The research suggests that artificial intelligence can not only identify individual social cues, but also capture the underlying structure of how humans perceive social information.

Large language models (LLMs) are advanced machine learning systems that can generate human-like responses to text inputs. Over the past few years, LLMs have become capable of passing professional exams, emulating personality traits, and simulating theory of mind. More recently, models such as GPT-4V have gained the ability to process visual inputs, making it possible for them to “see” and describe scenes, objects, and people.

This leap in visual capability opens new possibilities for psychological research. Human social perception depends heavily on our ability to make quick inferences from visual input—interpreting facial expressions, body posture, and interactions between people.

If AI models can match or approximate these human judgments, they may offer scalable tools for behavioral science and cognitive neuroscience. But the key question remains: How well can AI interpret the nuanced, often ambiguous social signals that humans rely on?

To explore this question, researchers at the University of Turku used OpenAI’s GPT-4V to evaluate a set of 468 static images and 234 short video clips, all depicting scenes with rich social content drawn from Hollywood films. The goal was to see whether GPT-4V could detect the presence of 138 different social features—ranging from concrete behaviors like “laughing” or “touching someone” to abstract traits like “dominant” or “empathetic.”

These same images and videos had previously been annotated by a large group of human participants. In total, over 2,200 individuals contributed more than 980,000 perceptual judgments using a sliding scale from “not at all” to “very much” to rate each feature. The human evaluations were used as a reference point to assess how closely GPT-4V’s ratings aligned with the consensus of real observers.

For each image or video, the researchers prompted GPT-4V to generate numerical ratings for the full set of social features. They repeated this process five times to account for the model’s variability, then averaged the results. In the case of video clips, since GPT-4V cannot yet directly process motion, the researchers extracted eight representative frames and added the transcribed dialogue from the clip.

The results showed a high level of agreement between GPT-4V and human observers. The correlation between AI and human ratings was 0.79 for both images and videos—a level that approaches the reliability seen between individual human participants. In fact, GPT-4V outperformed single human raters for 95% of the social features in images and 85% in videos.

However, GPT-4V’s ratings did not always match group-level consensus. When compared to the average of five human raters, the AI’s agreement was slightly lower, particularly for video clips. This suggests that while GPT-4V provides a strong approximation of human perception, its reliability may not yet match the collective judgment of multiple human observers working together.

The study also examined whether GPT-4V captured the deeper structure of how humans organize social information. Using statistical techniques such as principal coordinate analysis, the researchers found that the dimensions GPT-4V used to represent the social world—such as dominant vs. empathetic or playful vs. sexual—were strikingly similar to those found in human data.

This suggests that the model is not only mimicking surface-level judgments but may be tapping into similar patterns of representation that humans use to make sense of social interactions.

To take the comparison one step further, the researchers used GPT-4V’s social feature annotations as predictors in a functional MRI (fMRI) study. Ninety-seven participants had previously watched a medley of 96 short, socially rich video clips while undergoing brain scans. By linking the social features present in each video to patterns of brain activity, the researchers could map which areas of the brain respond to which types of social information.

Remarkably, GPT-4V-based stimulus models produced nearly identical brain activation maps as those generated using human annotations. The correlation between the two sets of maps was extremely high (r = 0.95), and both identified a similar network of regions—such as the superior temporal sulcus, temporoparietal junction, and fusiform gyrus—as being involved in processing social cues.

This finding provides evidence that GPT-4V’s judgments can be used to model how the brain perceives and organizes social information. It also suggests that AI models could assist in designing and interpreting future neuroimaging experiments, especially in cases where manual annotation would be time-consuming or expensive.

These findings open several possible directions for future research and real-world applications. In neuroscience, LLMs like GPT-4V could help generate high-dimensional annotations of complex stimuli, allowing researchers to reanalyze existing brain data or design new experiments with greater precision. In behavioral science, AI could serve as a scalable tool for labeling emotional and social content in large datasets.

Outside the lab, this technology could support mental health care, by identifying signs of distress in patient interactions, or improve customer service by analyzing emotional cues in video calls. It could also be used in surveillance systems to detect potential conflicts or identify unusual social behaviors in real-time settings.

At the same time, the study’s authors caution that these models are not perfect replacements for human judgment. GPT-4V performed worse on some social features that involve more subjective or ambiguous judgments, such as “ignoring someone” or “harassing someone.” These types of evaluations may require contextual understanding that AI systems still lack, or may be influenced by training data biases or content moderation filters.

The model also tended to rate low-level features more conservatively than humans—possibly due to its probabilistic nature or its safeguards against generating controversial outputs. In some cases, the AI refused to evaluate scenes containing sexual or violent content, highlighting the constraints imposed by platform-level safety policies.

While the results are promising, some limitations should be noted. The AI ratings were compared against a relatively small number of human raters per stimulus, and larger datasets could provide a more robust benchmark. The model was also tested on short, scripted film clips rather than real-world or live interactions, so its performance in more natural settings remains an open question.

Future work could explore whether tailoring LLMs to specific demographic perspectives improves their alignment with particular groups. Researchers might also investigate how AI models form these judgments—what internal processes or representations they use—and whether these resemble the mechanisms underlying human social cognition.

The study, “GPT-4V shows human-like social perceptual capabilities at phenomenological and neural levels,” was authored by Severi Santavirta, Yuhang Wu, Lauri Suominen, and Lauri Nummenmaa.

RELATED

Scientists shocked to find AI’s social desirability bias “exceeds typical human standards”
Artificial Intelligence

A simple language switch can make AI models behave significantly differently

January 23, 2026
LLM red teamers: People are hacking AI chatbots just for fun and now researchers have catalogued 35 “jailbreak” techniques
Artificial Intelligence

Are you suffering from “cognitive atrophy” due to AI overuse?

January 22, 2026
Scientists reveal atypical depression is a distinct biological subtype linked to antidepressant resistance
Artificial Intelligence

Researchers are using Dungeons & Dragons to find the breaking points of major AI models

January 22, 2026
Groundbreaking AI model uncovers hidden patterns of political bias in online news
Artificial Intelligence

AI chatbots tend to overdiagnose mental health conditions when used without structured guidance

January 22, 2026
AI chatbots often misrepresent scientific studies — and newer models may be worse
Artificial Intelligence

Sycophantic chatbots inflate people’s perceptions that they are “better than average”

January 19, 2026
Google searches for racial slurs are higher in areas where people are worried about disease
Artificial Intelligence

Learning from AI summaries leads to shallower knowledge than web search

January 17, 2026
Neuroscientists find evidence meditation changes how fluid moves in the brain
Artificial Intelligence

Scientists show humans can “catch” fear from a breathing robot

January 16, 2026
Poor sleep may shrink brain regions vulnerable to Alzheimer’s disease, study suggests
Artificial Intelligence

How scientists are growing computers from human brain cells – and why they want to keep doing it

January 11, 2026

PsyPost Merch

STAY CONNECTED

LATEST

Genetic factors likely confound the link between c-sections and offspring mental health

Major new study finds psilocybin microdoses improve the quality of creative ideas but not the quantity

Donald Trump weaponizes humor through “dark play” to test boundaries

Severe sleep problems is associated with fewer years of healthy brain function

Childhood adversity linked to accelerated biological aging in women, new study finds

People in romantic relationships who show a high-K fitness profile are more likely to be “good” patients

General anxiety predicts conspiracy beliefs while political anxiety does not

Psychopathic female criminals exhibit unexpected patterns of emotional processing

RSS Psychology of Selling

  • New research links faking emotions to higher turnover in B2B sales
  • How defending your opinion changes your confidence
  • The science behind why accessibility drives revenue in the fashion sector
  • How AI and political ideology intersect in the market for sensitive products
  • Researchers track how online shopping is related to stress
         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy