During speech production, it is evident that language embeddings (blue) in the IFG peaked before speech embeddings (red) peaked in the sensorimotor area, followed by the peak of speech encoding in the STG. In contrast, during speech comprehension, the peak encoding shifted to after the word onset, with speech embeddings (red) in the STG peaking significantly before language encoding (blue) in the IFG.
All in all, our findings suggest that the speech-to-text model embeddings provide a cohesive framework for understanding the neural basis of processing language during natural conversations. Surprisingly, while Whisper was developed solely for speech recognition, without considering how the brain processes language, we found that its internal representations align with neural activity during natural conversations. This alignment was not guaranteed — a negative result would have shown little to no correspondence between the embeddings and neural signals, indicating that the model’s representations did not capture the brain’s language processing mechanisms.
A particularly intriguing concept revealed by the alignment between LLMs and the human brain is the notion of a “soft hierarchy” in neural processing. Although regions of the brain involved in language, such as the IFG, tend to prioritize word-level semantic and syntactic information — as indicated by stronger alignment with language embeddings (blue) — they also capture lower-level auditory features, which is evident from the lower yet significant alignment with speech embeddings (red). Conversely, lower-order speech areas such as the STG tend to prioritize acoustic and phonemic processing — as indicated by stronger alignment with speech embeddings (red) — they also capture word-level information, evident from the lower yet significant alignment with language embeddings (blue).