Subscribe
The latest psychology and neuroscience discoveries.
My Account
  • Mental Health
  • Social Psychology
  • Cognitive Science
  • Psychopharmacology
  • Neuroscience
  • About
No Result
View All Result
PsyPost
PsyPost
No Result
View All Result
Home Exclusive Artificial Intelligence

AI learns language through the experience of a single child in groundbreaking study

by Eric W. Dolan
February 1, 2024
in Artificial Intelligence, Cognitive Science
An 18-month-old baby wearing a head-mounted camera. (Photo by Wai Keen Vong)

An 18-month-old baby wearing a head-mounted camera. (Photo by Wai Keen Vong)

Share on TwitterShare on Facebook

In a groundbreaking study published in the journal Science, researchers have developed a machine learning model that mimics the way children learn language, offering new insights into early language acquisition. Using video and audio recordings from a young child’s perspective, the model successfully learned to associate words with visual objects, a feat that sheds light on the mysterious process of how children begin to understand and use language.

Understanding how children learn language has long been a fascinating subject for scientists and educators alike. At the heart of this is the phenomenon of connecting words to their meanings – a process seemingly simple yet incredibly complex. This study sought to demystify this process using the latest advancements in artificial intelligence.

The motivation behind this research lies in the need for a deeper understanding of early language acquisition. Traditionally, studies in this field have been conducted in controlled laboratory settings, which may not accurately reflect the natural environment in which children learn language.

Furthermore, there is a growing interest in developing artificial intelligence systems that can learn language in human-like ways. By uncovering the mechanisms behind how children link words to their visual counterparts, researchers hoped to not only enrich cognitive science but also guide the development of more advanced AI systems.

“I’ve been doing research on concept and language acquisition from the beginning of my research career, as I think there are a lot of interesting questions behind how humans and machines can learn and use concepts and language. Working with the dataset that was used in this paper (the SAYCam-S dataset) provided a unique opportunity to study these kinds of questions, and seeing if models could learn anything from naturalistic slices from a single child’s input,” explained study author Wai Keen Vong, a research scientist at the Center for Data Science at New York University.

The SAYCam-S dataset was gathered using a head-mounted camera worn by a single child, capturing video and audio recordings from the age of 6 to 25 months. The dataset included 600,000 video frames paired with 37,500 transcribed utterances, derived from 61 hours of video. This approach aimed to mirror the natural learning environment of a child, contrasting with the more controlled settings of traditional laboratory studies.

Vong and his colleagues created a machine learning model, named the Child’s View for Contrastive Learning model (CVCL), which was fed video frames representing what the child saw and linguistic utterances, representing what the child heard.

The CVCL model was designed to learn multimodal representations – a combination of visual and linguistic elements – and associate them with each other. The training of CVCL was self-supervised, meaning it did not rely on external labeling of data. Instead, the model learned by associating temporally co-occurring video frames and utterances as matching pairs, and treating non-co-occurring pairs as mismatches.

This contrastive learning approach aimed to mimic the way children learn language – by associating words they hear with objects and events they see in their environment. During training, the model randomly sampled video frames associated with each utterance and applied data augmentation to these images for robust learning.

https://www.psypost.org/wp-content/uploads/2024/01/Vong-adi1374-video-6.mp4

The model’s performance was evaluated against a range of everyday words and their corresponding visual referents in categorization tasks. It was also tested on its ability to generalize to novel visual exemplars not seen during training and to align visual and linguistic conceptual systems broadly.

“By using AI models to study the real language-learning problem faced by children, we can address classic debates about what ingredients children need to learn words—whether they need language-specific biases, innate knowledge, or just associative learning to get going,” explained co-author Brenden Lake, an assistant professor at New York University.

The model achieved a classification accuracy of 61.6% on a dataset of frames annotated with 22 visual concepts, showing its capability to match words with visual objects effectively. In comparison tests, CVCL performed close to a more extensively trained image-text contrastive neural network, CLIP, which was trained on much more data. The model demonstrated modest knowledge of additional visual concepts when tested on novel stimuli, with an accuracy of 34.7%. This is significant as it suggests the model’s ability to generalize beyond its training.

The findings from this study have broad implications for both cognitive science and the development of AI systems. The success of CVCL in mimicking a child’s language learning process challenges traditional theories that suggest more complex cognitive capacities are necessary for language acquisition. It demonstrates that simple associative learning mechanisms, coupled with multimodal representation learning, can be a solid foundation for understanding and replicating the early stages of word learning.

“Today’s state-of-the-art AI systems are trained using astronomical amounts of data (often billions/trillions of words), and yet humans manage to learn and use language with far less data (hundreds of millions of words), so the connection between these advances in machine learning to human language acquisition is not clear,” Vong explained to PsyPost. “To bridge that gap, in our work, we trained a multimodal neural network on 61 hours of visual and linguistic input from one child, and examined how much the model could learn, particularly in connecting words to their visual counterparts (e.g. connecting the word ‘ball’ to images of balls).”

“Surprisingly, the model acquired most (but not all) of the concepts present in its everyday experience, and could generalize this to visual instances of those words it hadn’t encountered either. These results suggest that the kinds of generic, associative learning mechanisms found in neural networks are sufficient for breaking into early word learning, without the need to posit additional constraints or inductive biases like other researchers have previously argued were necessary for language acquisition.”

However, the study is not without limitations. The data used was from a single child’s perspective, which may not represent the diversity of experiences across different children. Furthermore, the model’s ability to generalize to a broader range of linguistic and visual contexts remains to be tested. The CVCL model also does not account for the active, embodied nature of a child’s learning process and learns from static frames rather than temporally extended episodes.

“One caveat is that the language input to the model is text, not the underlying speech signal that children receive,” Vong said. “When learning from raw speech, children also need to learn how to segment the speech signal into individual words, which is not needed in our model. While this was a small limitation with the current study, we are confident that many of the aspects of our model could be left intact while incorporating the raw speech in future work.”

“We would also like to train our model on data from the additional children from the SAYCam dataset (there is video collected from two additional babies that we did not use in the current study), to see if our results are consistent and generalizable.”

Looking ahead, the research opens several avenues for future exploration. Incorporating more cognitively plausible assumptions into the model, such as the role of active learning, could bring the learning process in models closer to that in children. Additionally, extending the model to handle more complex aspects of language acquisition and testing it with data from multiple children are crucial steps forward.

The study, “Grounded language acquisition through the eyes and ears of a single child“, was authored by Wai Keen Vong, Wentao Wang, A. Emin Orhan, and Brenden M. Lake.

RELATED

Scientists uncover previously unknown target of alcohol in the brain: the TMEM132B-GABAA receptor complex
Cognitive Science

Neuroscience study reveals that familiar rewards trigger motor preparation before a decision is made

January 20, 2026
AI chatbots often misrepresent scientific studies — and newer models may be worse
Artificial Intelligence

Sycophantic chatbots inflate people’s perceptions that they are “better than average”

January 19, 2026
Trump supporters and insecure men more likely to value a large penis, according to new research
Cognitive Science

Negative facial expressions interfere with the perception of cause and effect

January 18, 2026
Google searches for racial slurs are higher in areas where people are worried about disease
Artificial Intelligence

Learning from AI summaries leads to shallower knowledge than web search

January 17, 2026
Scientists link dyslexia risk genes to brain differences in motor, visual, and language areas
Cognitive Science

Elite army training reveals genetic markers for resilience

January 17, 2026
Neuroscientists find evidence meditation changes how fluid moves in the brain
Artificial Intelligence

Scientists show humans can “catch” fear from a breathing robot

January 16, 2026
Spacing math practice across multiple sessions improves students’ test scores and helps them accurately judge their learning
Cognitive Science

Boys and girls tend to use different strategies to solve math problems, new research shows

January 15, 2026
New research highlights the emotional and cognitive benefits of classical music ensembles for youth
Cognitive Science

Music training may buffer children against the academic toll of poverty

January 14, 2026

PsyPost Merch

STAY CONNECTED

LATEST

One specific form of insecurity is significantly lower among singles who have casual sex

Maladaptive personality traits are linked to poor sleep quality in new twin study

Depression’s impact on fairness perceptions depends on socioeconomic status

Early life adversity primes the body for persistent physical pain, new research suggests

Economic uncertainty linked to greater male aversion to female breadwinning

Women tend to downplay their gender in workplaces with masculinity contest cultures

Young people show posttraumatic growth after losing a parent, finding strength, meaning, and appreciation for life

MDMA-assisted therapy shows promise for long-term depression relief

RSS Psychology of Selling

  • How defending your opinion changes your confidence
  • The science behind why accessibility drives revenue in the fashion sector
  • How AI and political ideology intersect in the market for sensitive products
  • Researchers track how online shopping is related to stress
  • New study reveals why some powerful leaders admit mistakes while others double down
         
       
  • Contact us
  • Privacy policy
  • Terms and Conditions
[Do not sell my information]

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

Subscribe
  • My Account
  • Cognitive Science Research
  • Mental Health Research
  • Social Psychology Research
  • Drug Research
  • Relationship Research
  • About PsyPost
  • Contact
  • Privacy Policy