Imagine you’re studying animals and you notice a pattern – bigger animals tend to have longer names. Curious, you decide to dig deeper and explore if this pattern holds true in different groups of animals. That’s exactly what the scientists did in a recent study, but instead of animals, they focused on words in different languages! Using a large-scale web-scraping database, they examined the relationship between word length and average information content (surprisal) in 11 Indo-European languages. However, recent studies suggested that preprocessing methods may influence the results. To shed light on this debate, the researchers conducted a strict analysis of Japanese words using Google’s web-scraping database. And what did they find? The results showed that even in Japanese, word length can be predicted by average information content! This adds valuable evidence to the ongoing discussion, expanding our understanding beyond Indo-European languages. If you’re fascinated by language and want to explore the research in detail, click the link below!
Abstract
Piantadosi, Tily, and Gibson analyzed a large-scale web-scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2- to 4-gram model (hereafter, longer-span surprisal) across 11 Indo-European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large-scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German-specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large-scale but less noisy database. These three studies provide evidence from 11 Indo-European languages and one Afro-Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google’s web-scraping database. The results show that Japanese word length can be predicted independently by 2- to 4-gram surprisal.
Dr. David Lowemann, M.Sc, Ph.D., is a co-founder of the Institute for the Future of Human Potential, where he leads the charge in pioneering Self-Enhancement Science for the Success of Society. With a keen interest in exploring the untapped potential of the human mind, Dr. Lowemann has dedicated his career to pushing the boundaries of human capabilities and understanding.
Armed with a Master of Science degree and a Ph.D. in his field, Dr. Lowemann has consistently been at the forefront of research and innovation, delving into ways to optimize human performance, cognition, and overall well-being. His work at the Institute revolves around a profound commitment to harnessing cutting-edge science and technology to help individuals lead more fulfilling and intelligent lives.
Dr. Lowemann’s influence extends to the educational platform BetterSmarter.me, where he shares his insights, findings, and personal development strategies with a broader audience. His ongoing mission is shaping the way we perceive and leverage the vast capacities of the human mind, offering invaluable contributions to society’s overall success and collective well-being.