[Academic report] The 61th Sui Yuan Xin Guang Young Scholars’ Forum –Mult-imodal Language Interface and Hierarchical Speech Prediction Mechanism
On the afternoon of December 6, 2025, the 61th Sui Yuan Xin Guang Young Scholars’ Forum was held in Room 312 of the School of Psychology at Nanjing Normal University. This session featured a lecture by Researcher Du Yi from the Institute of Psychology, Chinese Academy of Sciences, titled “Multi-modal Language Interface and Hierarchical Prediction Mechanisms in Speech.” The lecture was chaired by Chen Qingrong, the Chair of the Academic Affairs Office of Nanjing Normal University and the full Professor at the School of Psychology, Nanjing Normal University. Many faculty members and students from the School of Psychology participated in the discussion.
Researcher Du Yi is a professor and doctoral supervisor at the Institute of Psychology, Chinese Academy of Sciences (CAS). She serves as the Director of the Cognitive and Developmental Psychology Laboratory at the Institute of Psychology, CAS, holds a CAS Distinguished Core Faculty Position, and serves as Deputy Secretary-General of the Chinese Psychological Society and Vice Chair of its Music Psychology Committee. She is also an editorial board member of Acta Psychologica Sinica. Her research primarily employs cognitive neuroscience techniques and computational modeling to investigate the cognitive neurocomputational mechanisms underlying human speech and musical communication. She has published papers in top-tier international journals including Nature Communications, Science Advances, PNAS, PLOS Biology, and eLife. She leads projects such as the National Science Fund for Distinguished Young Scholars and the Major Project of Science and Technology Innovation 2030.
In her lecture, Researcher Du Yi first explored the multi-modal nature of language—how the same semantic concept can be expressed through multiple forms such as speech, text, and images. The core question is: How are sensorimotor features integrated into language networks? Functional MRI experiments revealed that the posterior part of the middle frontal gyrus (55B) shows stable activation across listening, speaking, reading, and writing tasks, with stronger activation during language production. Further characterization analysis indicates that the anterior portion of 55B encodes semantics more prominently, while the posterior portion encodes phonetics more strongly, supporting its functional differentiation as a “sensorimotor-semantic interface.”
Moreover, 55B occupies a pivotal position in functional connectivity networks, bridging language and sensorimotor networks. Structural and functional gradient analyses also reveal this region as a “steep jump zone” between sensory-motor and higher-order association cortices, providing a biological foundation for cross-modal information integration. This discovery offers neural evidence for multimodal information alignment mechanisms in the human brain and provides bio-inspired interface concepts for designing current multimodal large models. Regarding language prediction mechanisms, the idiom completion experiment revealed that when the final character is omitted (e.g., “恩重如□”), the brain enhances phonetic representations of the missing character in speech-motor regions (e.g., premotor cortex) while strengthening semantic representations in default mode network regions (e.g., angular gyrus, posterior cingulate cortex). Crucially, brain representations under prediction exhibit a “geometric expansion” trend in low-dimensional space, suggesting top-down semantic extraction plays a pivotal role in prediction.
Regarding error processing, the study further revealed that upon hearing an incorrect idiom (e.g., “琳琅满天”), the brain simultaneously represented the predicted character, the perceived character, and their discrepancy. The representation of semantic prediction error was located between predictive and perceptual information, exhibiting hierarchical distribution across both temporal and frontal lobes. This suggests the brain may dynamically update internal predictive models through error signals.
In the conclusion session of the lecture, faculty and students engaged in an in-depth discussion on topics including experimental design, brain region-specific functions, and real-time prediction in natural language understanding. Prof. Du Yi noted that the team has collected additional magnetoencephalography data and will integrate higher-temporal-resolution neuroimaging techniques in the future to uncover the dynamic network mechanisms underlying language prediction and error processing.
This lecture systematically showcased the latest advances in language neuroscience concerning cross-modal integration and predictive processing. By integrating experimental findings with theoretical frameworks, it provides significant insights for research in language comprehension, artificial intelligence, and human-brain-inspired computational models.