The team tested familiar effects—word length, how common a word is, how surprising it is in context, and measures of syntactic complexity—using a statistical model that separates person-level signals from noise. The clearest finding is that simple, perceptual signals like word length are stable within individuals. More complex measures tied to meaning and sentence structure show weaker and sometimes inconsistent person-level stability, and linking eye-tracking to self-paced reading often reveals only modest agreement.

For anyone interested in learning, assessment, or inclusive education, this matters because interventions and diagnostics often rely on the idea that individual reading traits are measurable and transferable across settings. These results invite a careful rethink of how we measure reading skill and language processing if we want tests and tools that fairly reflect diverse learners’ abilities. Follow the full article to explore how measurement choices shape what we can confidently say about individual differences in reading.

Abstract
Psycholinguistic theories traditionally assume similar cognitive mechanisms across different speakers. However, more recently, researchers have begun to recognize the need to consider individual differences when explaining human cognition. An increasing number of studies have investigated how individual differences influence human sentence processing. Implicitly, these studies assume that individual-level effects can be replicated across experimental sessions and different assessment methods such as eye-tracking and self-paced reading. However, this assumption is challenged by the Reliability Paradox. Thus, a crucial first step for a principled investigation of individual differences in sentence processing is to establish their measurement reliability, that is, the correlation of individual-level effects across multiple measurement occasions and methods. In this work, we present the first naturalistic eye movement corpus of reading data with four experimental sessions from each participant (two eye-tracking sessions and two self-paced reading sessions). We deploy a two-task Bayesian hierarchical model to assess the measurement reliability of individual differences in a range of psycholinguistic phenomena that are well-established at the population level, namely, effects of word length, lexical frequency, surprisal, dependency length, and number of to-be-integrated dependents. While our results indicate high reliability across measurement occasions for the word length effect, it is only moderate for higher-level psycholinguistic predictors such as lexical frequency, dependency distance, and the number of to-be-integrated dependencies, and even low for surprisal. Moreover, even after accounting for spillover effects, we observe only low to moderate reliability at the individual level across methods (eye-tracking and self-paced reading) for most predictors, and poor reliability for predictors of syntactic integration. These findings underscore the importance of establishing measurement reliability before drawing inferences about individual differences in sentence processing.

Read Full Article (External Site)