The authors analyze real video-call conversations and link surprisal to several prosodic cues. Their results show a clear relationship between unpredictability and duration, and they find that pitch features also rise with surprisal even after accounting for how long words last. They explore how brief listener responses relate to surges in surprisal and examine how the size of the prior context changes which acoustic features the model best predicts. The methods are open and reproducible, which strengthens confidence in the findings.
These patterns point toward speech shaped by the need to be intelligible: speakers appear to accentuate information that listeners would find harder to predict. For anyone curious about human potential in social settings, the study hints at how adaptive our communication systems are and how subtle signals support inclusion and mutual understanding. Follow the link to see how these acoustic choices map onto real conversations and what that might mean for designing technologies that respect human communicative needs.
Abstract
Conversation is a dynamic, multimodal activity involving the exchange of complex streams of information like words, prosody, gesture, eye contact, and backchannels. Understanding how these different channels interact in naturalistic scenarios is essential for understanding the mechanisms governing human communication. Past studies suggested that the duration of words is tied to their predictability in context, but it remains unclear whether this relationship is speaker-oriented (e.g., retrieval or production-based) or due to listener-oriented, intelligibility-based pressures (i.e., emphasizing unpredictable words to ease comprehension). This study aims to examine the relationship between predictability and additional acoustic variables, to test how much intelligibility-oriented principles impact conversation. We use the GPT-2 large language model to assess the relationship between surprisal, a measure of unpredictability, and several variables known to play an important role in conversation—the prosodic features of duration, intensity, and pitch. We perform this analysis on the CANDOR corpus of naturalistic spoken video call conversation between strangers in English. In keeping with previous results using n-gram predictability, we find that GPT-2 surprisal predicts significantly higher values for duration. Moreover, surprisal also predicts maximum pitch and pitch range even when controlling for duration, with mixed evidence for an effect of surprisal on intensity. Additionally, we investigated listener backchannels (short interjections like “yeah” or “mhm”) and found that listener backchannels tended to be accompanied and followed by a spike in the surprisal of speakers’ words. Finally, we demonstrate a divergence between the effect of context window size on the model fit of surprisal to maximum pitch and to other variables. The results provide additional support for intelligibility-based accounts, which hold that language production is sensitive to a pressure for successful communication, not just speaker-oriented pressures. Our data and analysis code are shared: https://osf.io/sqpn6/?view_only=e4d9e36c68b54863bc781e359463e1fe.