The team compared where people looked while planning and while speaking, and then linked those eye patterns to the meanings of the sentences produced. Strong parallels emerged across languages: similar sentence meanings were paired with similar eye-movement sequences, even when word order and syntax differed. Syntactic structure influenced gaze in predictable ways, but only briefly and mainly within a single language during the earliest planning moments.

This work matters because it suggests there is a common, language-agnostic representational layer that guides how we translate what we see into what we say. If vision supplies a stable semantic scaffold, then language-specific rules are applied on top of that scaffold to shape linear speech. Follow the full article to explore how these findings reshape ideas about human communication, learning, and inclusive design of tools that rely on shared visual understanding.
Abstract
A central question in cognition is how representations are integrated across different modalities, such as language and vision. One prominent hypothesis posits the existence of an abstract, prelinguistic “language of vision” as a representational system that organizes meaning compositionally, enabling cross-modal integration. This hypothesis predicts that the language of vision operates universally, independent of linguistic surface features such as word order. We conducted eye-tracking experiments where participants described visual scenes in English, Portuguese, and Japanese. By analyzing spoken descriptions alongside eye-movement sequences divided into planning and articulation phases, we demonstrate that semantic similarity between sentences strongly predicts the similarity of associated scan patterns in all three languages, even across scenes and between sentences in different languages. In contrast, the effect of syntactic constraints was secondary and transient: it was restricted to within-language and within-scene comparisons, and temporally confined to the early planning phase of the utterance. Our findings support an interactive account of cross-modal coordination in which a universal language of vision provides stable semantic scaffolding, while syntax serves as a local constraint, primarily active during message linearization.