Unlocking the Mysteries of Japanese Word Length

Abstract
Piantadosi, Tily, and Gibson analyzed a large-scale web-scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2- to 4-gram model (hereafter, longer-span surprisal) across 11 Indo-European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large-scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German-specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large-scale but less noisy database. These three studies provide evidence from 11 Indo-European languages and one Afro-Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google’s web-scraping database. The results show that Japanese word length can be predicted independently by 2- to 4-gram surprisal.
Read Full Article (External Site)

Unlocking the Mysteries of Japanese Word Length

From Human Child to Grey Parrot: Exploring a Common Model of Word Meaning Extension Across Species

Language Universals in Sentence Length: Comparing Sentence Length Distributions of 10 Languages