Unlocking the Mysteries of Japanese Word Length

Published on June 12, 2023

Imagine you’re studying animals and you notice a pattern – bigger animals tend to have longer names. Curious, you decide to dig deeper and explore if this pattern holds true in different groups of animals. That’s exactly what the scientists did in a recent study, but instead of animals, they focused on words in different languages! Using a large-scale web-scraping database, they examined the relationship between word length and average information content (surprisal) in 11 Indo-European languages. However, recent studies suggested that preprocessing methods may influence the results. To shed light on this debate, the researchers conducted a strict analysis of Japanese words using Google’s web-scraping database. And what did they find? The results showed that even in Japanese, word length can be predicted by average information content! This adds valuable evidence to the ongoing discussion, expanding our understanding beyond Indo-European languages. If you’re fascinated by language and want to explore the research in detail, click the link below!

Abstract
Piantadosi, Tily, and Gibson analyzed a large-scale web-scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2- to 4-gram model (hereafter, longer-span surprisal) across 11 Indo-European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large-scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German-specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large-scale but less noisy database. These three studies provide evidence from 11 Indo-European languages and one Afro-Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google’s web-scraping database. The results show that Japanese word length can be predicted independently by 2- to 4-gram surprisal.

Read Full Article (External Site)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>