I was looking at a plane’s inflight magazine a few years ago that had the same text in various languages. It occurred to me that logographic languages (like Mandarin) seemed quite wasteful, given the large number of characters they have to encode. On the other hand, each character probably encodes a lot more meaning than a letter in the Latin alphabet. So how does that work out? In modern times, the place where this matters the most is – Twitter: you have a real constraint on the length of your content (280 characters – or so I thought), so it matters how much information you can condense into a single tweet. Are there certain languages where you have naturally more information density built in? And related: does the evolution of languages optimize at all for information density? (Which would be quite remarkable)
Here is my answer: the chart below shows the “effective” number of characters in a tweet in a particular language, indexing English to 280 characters, and scaling that number up or down based on a particular language’s information density.
Here is how I calculated that:
- Pick a text. I used part of Charles Dickens’s “A Christmas Carol”.
- Run it through the Google Translate API and translate it into all the languages listed above. (Pretty much all the languages available in Google Translate)
- For each translation, create an alphabet that would let you encode the translated text without any loss:  iterate through the entire text,  create a new entry in the alphabet if you encounter a new character, or  increase the character’s count in the alphabet if you’ve seen the character before.
- Calculate the entropy of the language as the sum over p(x) * log(x), where x iterates over all of the just constructed’s alphabet’s letter, and p(x) is the frequency with which the letter occurs. This is basically the information density implied by Huffman encoding, which is a simple form of compression that assigns the shortest codeword (measured in bits) to the most frequently occurring letter in an alphabet, and so on.
- Then scale the 280 character number for English tweets up or down linearly based on a language’s comparable entropy to English’s entropy.
Three observations from the chart above:
- Amazing how close languages are in their entropy. Either all languages go back to some kind of ur-language – or this is really the process of “meme” evolution at work: humans in very different parts of the world have been speaking and writing so much over the past thousands of years that basically every language is pretty comparable to all other languages in terms of how much information it packs.
- When Twitter expanded its character limit to 280 characters a few years ago, it started counting “glyphs” twice (Korean, Chinese and Japanese characters) – so the character limit in those languages is effectively still 140. Here is the blog post, and it just looked at empirical Twitter data, not language entropy. And I think they got that wrong: the right weight for each glyph should be around 1.5. Right now, if you write a Chinese tweet, you arguably have less space than writing an English tweet, because the language’s remarkably higher information density doesn’t make up entirely for the 2:1 glyph weight “penalty”.
- Back to my original question: it is in fact that case that Mandarin is way more densely packed than English. The same text in Mandarin would require you to transmit 40% fewer bits than transmitting that text in English.
Another interesting question is the “visual” entropy of languages: how many bits are needed to encode an alphabet’s character’s visual design, without loss of legibility and thus information content? That’s for another time.