I was wondering which language expresses more information per square pixel, and I think the answer is somewhat counterintuitive: it’s Mandarin, despite its apparent visual splendor. In other words, even though each Chinese letter is more visually complicated than any English letter, it also carries more information because you need fewer Chinese letters to express the same content.
Here is the chart that shows this. In short, this chart shows what happens if you downsample an image with the same content step by step, and calculate how much information is still left in the image. The peak for each language is each language’s maximum information density per square pixel. Chinese wins, English is the worst.
Here is the process to calculate this:
- Pick a text. In this case, I picked a 7th grade reading comprehension text – a little story of a girl baking bread with her father of around 1,000 words.
- Pick a target language.
- Translate the text into the target language.
- Print the text in the language onto an image. Force it onto a 2,000 pixel-wide image, and wrap the text.
- Run image recognition on the image to read the text.
- Translate the recognized text back into English.
- Feed the text into the OpenAI Completions API (running on GPT-3). Run through 10 questions that need one-word answers, and have OpenAI answer the questions. This gives us a score from 0 to 1 that expresses how many answers it got right.
- Feed the text again into the OpenAI Completions API. Now ask GPT-3 straight up to compare the recognized, re-translated text to the original text. This gives us another score from 0 to 1 that expresses how similar GPT-3 thinks the two texts are (through its mysterious inner workings).
- Calculate the text embedding of the recognized, re-translated text using the OpenAI Embeddings API. This gives us another score from 0 to 1 that expresses how similar the two texts’ embeddings are, using cosine similarity, in embeddings space.
- Now downsample the original image by 10% – i.e., just shrink the image. Don’t change anything else. That means the original content is now expressed on fewer square pixels.
- Repeat steps 5 to 10, until the text quality on the image gets so low that the algorithms start completely failing.
Some details on each of the steps above.
- We need a font that can express any language you want, without giving any particular language an “advantage” (by adding or subtracting visual “flair” relative to another language). The Google Noto Sans font family can do that.
- We need to pick a downsampling filter for when we shrink the image. I wanted it to be a simple one, so I used the Box filter in the Python Pillow library. Other filters might retain visual language information for longer (at smaller images). It would be weird if that changed the relative results between languages.
- Notice that there are no languages in this comparison set that write from right to left. That’s because I couldn’t figure out how to install the libraqm library with Pillow and I just eventually gave up. You need that (obscure) library to change text direction.
- I find my usual observation to be that when using a large language model for anything, the LLM is often better the less you constrain it. I played a bunch with asking the “right questions” of the text through GPT-3, and eventually I just told GPT-3 to compare two texts straight up (“how similar in meaning are these”). The two results were incredibly similar.
- The embeddings comparison isn’t that useful, on the other hand: let’s say we compare a massively downsampled text image with the original text. The text image quality is so low that GPT-3 only answers 2 out of the original 10 questions correctly. Also, GPT-3 says “I think this is semantically similar at 4 out of 10”. In that case, the embedding cosine similarity is still 0.77 or so. Just too high.
Here is another way to visualize all this. This shows the “information value” (multiplying the two semantic scores coming out of the GPT-3 tests on question asking and similarity, ignoring the embeddings) vs. the square pixels of each downsampled image, by language. Same data as in the chart above. This shows better that there are indeed some structural weaknesses in the various machine learning APIs that are in the analysis pipeline here – for example, GPT-3’s comparison of the re-translated Japanese texts (this all happens in English, remember!) is never that high, so the information value starts at a lower level. Still, what happens at the lower square pixel numbers is what matters, and there it’s a consistent picture. Though it’s possible that Japanese is more visually dense than it’s given credit for here. But whatever happens, English remains the worst!