A new study published in Science Advances presents a method that converts human brain activity into coherent, descriptive text—even when the brain is not actively processing language. Instead of decoding words or sentences directly, the method interprets the nonverbal representations that occur before thoughts are put into words.
The study suggests that even when individuals are only watching or recalling silent video clips, their brain activity contains enough structured information to generate accurate descriptions of the scenes. Using functional MRI and advanced language models, Tomoyasu Horikawa, a distinguished researcher at NTT’s Communication Science Laboratories in Japan, was able to produce natural-language captions that closely matched both the objective content of the videos and the participants’ subjective recollections.
The motivation behind this work stemmed from a long-standing challenge in neuroscience: how to decode and interpret the rich, internal content of the human mind. While previous studies have shown some success in mapping brain activity to language, these efforts often rely on participants actively thinking in words, such as by speaking, reading, or listening. Such approaches limit the scope of decoding because not all mental experiences are verbal, and not all individuals have equal access to language, particularly those with conditions like aphasia.
Human thoughts often involve visual scenes, events, and abstract concepts that are not immediately translated into words. These mental representations can be detailed and structured, incorporating relationships between objects, actions, and environments. However, most decoding methods fall short of capturing this complexity, especially when relying on models that either imitate existing language structures or depend on hand-crafted databases of descriptions.
The researcher aimed to bridge this gap by developing a method that could interpret nonverbal mental representations—those formed during perception or memory—into coherent and meaningful text. The goal was not to read minds in the traditional sense, but to provide an interpretive interface that reflects what the brain is representing during an experience.
“I’ve long been fascinated by how the brain generates and represents content associated with our subjective conscious experiences, such as mental imagery and dreaming,” Horikawa told PsyPost. “I believe that brain decoding technology can help us investigate these questions while providing clear and intuitive interpretations of the information encoded in the brain.”
“Developing more sophisticated decoding methods could therefore advance our understanding of the neural bases of conscious experience — and, in the long run, help people whose difficulties might be relieved or overcome through direct information readout from the brain. The idea of mind captioning grew out of this effort — to better understand how such internal representations can be translated into language and shared meaningfully.”
Horikawa designed a decoding method called “mind captioning.” The approach involves two main steps: first, translating brain activity into semantic features using a deep language model; and second, generating natural language descriptions that align with those semantic features.
The study involved six adult participants, all native Japanese speakers with varying levels of English proficiency. They were shown thousands of short video clips depicting a wide range of visual content, including objects, actions, and social interactions. These videos were silent and shown without accompanying language. Functional MRI scans captured the participants’ brain activity during both the viewing of the videos and subsequent mental recall of the same clips.
The researcher trained a set of linear decoding models to map patterns of brain activity to semantic features extracted from captions written about each video. These semantic features were derived using a language model known as DeBERTa, which is designed to represent the meaning of text in a high-dimensional space.
After learning this mapping, the decoder was applied to new brain activity from both perception and recall conditions. The resulting semantic features were then used to generate text using another language model (RoBERTa) optimized for filling in missing words in a sentence. Through an iterative process of guessing, testing, and replacing words, the system gradually produced full sentences that reflected the brain’s decoded representations.
These generated sentences were evaluated in several ways. First, they were compared to human-written captions for accuracy and similarity using standard natural language evaluation metrics like BLEU, ROUGE, and BERTScore. The results showed that the machine-generated descriptions were highly discriminative: they could distinguish between different videos with strong reliability, even among 100 options.
The decoding method, when applied to participants’ brain activity, could identify the correct video with nearly 50% accuracy—a substantial improvement over the 1% expected by chance.
Notably, the method also generated quality descriptions from brain activity during the recall phase, though performance was not as high as for direct viewing. This indicates that the method could verbalize remembered experiences without requiring external stimuli. In some cases, the decoder performed well even on single instances of mental imagery.
“When I first tested the text generation algorithm after coming up with the approach, I was genuinely surprised to see how the original text corresponding to the extracted semantic features was progressively built up — step by step — into a coherent structure,” Horikawa said. “It felt as if I were hearing the faint voice of the brain seeping through the noise of the data, which made me confident that the approach could work.”
One of the key findings is that these descriptions included more than just lists of objects. They captured interactions and relationships, such as who did what to whom, or how different elements were arranged in space. When the word order of the generated sentences was shuffled, their similarity to the reference captions dropped sharply, showing that the original structure conveyed relational meaning, not just vocabulary.
“Another impressive finding came from the neuroscientific analysis shown in Figure 4E, where we examined how perception-trained decoders generalized to mental imagery using different types of feature representations (visual, visuo-semantic, and semantic),” Horikawa told PsyPost. “Although this trend was conceptually expected, we observed a remarkably clear gradient of generalizability across these levels, with semantic representations showing the strongest ability to bridge neural patterns between perception and recall.”
The study also found that descriptions could be generated without relying on activity in the brain’s traditional language areas. Even when these regions were excluded from analysis, the system still produced intelligible and structured descriptions. This suggests that meaningful semantic information is distributed across brain regions that process visual and contextual information, not just language.
“The study shows that it’s possible to generate coherent, meaningful text from brain activity — not by decoding language itself, but by interpreting the nonverbal representations that come before language,” Horikawa explained. “This may suggest that our thoughts are organized in a way that already carries structural information even before we put them into words, offering a new window into how the brain transforms experience into expression.”
“In the future, if we can learn to express ourselves more freely or interact with machines directly through our own brain activity, as in brain–machine interfaces, we may be able to unlock more of the brain’s potential.”
Although the study presents a promising approach, it comes with several limitations. The sample size was small, involving only six participants, all of whom underwent extensive training and scanning. However, each subject contributed many hours of data, which helped improve the decoding model’s reliability.
“Although our study included a relatively small number of participants, each contributed a substantial amount of data (about 17 hours of brain scanning), which allowed us to establish strong and reliable effects within individuals,” Horikawa said. “For example, the model achieved around 50% accuracy in a 100-alternative video identification task for each participant (see supplementary) — highly reliable performance given the difficulty of the problem (chance = 1%).”
“Importantly, these robust within-subject effects were consistently observed across all six participants, suggesting that the findings are practically significant despite the limited number of participants.”
Another limitation lies in the nature of the stimuli. The videos used in the study reflected common, real-world scenarios. It’s unclear whether the method would work as well for abstract concepts, atypical scenes, or highly personal mental content like dreams.
“As our method generates text from brain activity, it may be misinterpreted as a form of language decoding or reconstruction,” Horikawa noted. “However, this is not actually decoding of language information in the brain, but rather a linguistic interpretation of non-linguistic mental representations. Our method leverages the universal and versatile nature of natural language to provide intelligible interpretations of the information represented in the brain.”
There are also concerns about privacy. The idea of interpreting mental content raises ethical questions about autonomy and consent. While the current method requires large amounts of data from cooperative individuals, future advances may reduce this barrier.
“Some people may worry that this technology poses risks to mental privacy,” Horikawa told PsyPost. “In reality, the current approach cannot easily read a person’s private thoughts — it requires substantial data collection from highly cooperative participants, and its accuracy remains limited, with outputs affected by bias and noise. At present, the risks appear to be not high, though the ethical and social implications should continue to be discussed carefully as the technology develops.”
“What is important is not only to develop these technologies responsibly, but also to reflect on how we handle the information decoded from brain activity. We should avoid immediately treating the outputs as someone’s ‘true thoughts,’ and instead ensure that individuals retain autonomy in deciding whether and how to regard or present such outputs as their own intentions.”
Looking ahead, the approach could be extended to other types of mental content, such as auditory experiences, emotions, or internal narratives. It may also help in designing communication systems for individuals who cannot use speech or writing. By treating language as a bridge rather than the source, the method opens new possibilities for exploring how the brain generates and organizes meaning before it is expressed.
“My long-term goal is to understand the neural mechanisms underlying our subjective conscious experiences, and to help humans more fully realize the potential of the brain through scientific and technological advances,” Horikawa explained. “We plan to continue improving brain decoding approaches to access the information encoded in the brain more accurately and in greater detail, while ensuring that these technologies remain both scientifically valuable for understanding the brain and beneficial for people.”
The study, “Mind captioning: Evolving descriptive text of mental content from human brain activity,” was authored by Tomoyasu Horikawa.