The Languages Left Behind by AI

Puran Singh
9 hours ago
5 min read

*Multilingual keyboard. English keyboards are the default worldwide. Courtesy of Dreamstime.*

Thanks to ferocious investment and a wellspring of talent in Silicon Valley, the United States is dominating the AI market. Although others like China have sought to scale, the top ten most popular AI models are American. However, as the US continues to forge ahead, others risk being silently left behind.

To generate responses, AI models employ a more complicated version of what’s called a word n-gram approach, whereby the most probable next word in a sentence is chosen given previous context. To determine the most probable next word, AI models scrape reams of language data from the Internet – hence the common description of AI as a Large Language Model (LLMs). As most of the Internet is English, models trained on the broadest corpus use fewer tokens – units of compute for AI – in English than other languages. By the same token, the languages most underrepresented on the Internet are left the furthest behind in training datasets.

Furthermore, models are refined by human feedback and any incongruities between responses in different languages may go unnoticed. As languages across the world were already becoming less diverse before AI, many may feel forced to turn their back on their ancestral languages and culture and learn tongues with more training data. Frequently, tech executives herald AI as the ultimate educational tool, painting vivid pictures of children without access to education using AI to discover the world. However, given the emerging language gap, many linguistic minorities around the world may be left behind in the AI race.

The crux of the issue lies in the dominance of the English language on the Internet. According to W3Tech, an annual global Internet survey, approximately half of all web traffic is in English. Although it may not feel like it in the West, English is only spoken by around 20% of the world population and is therefore grossly overrepresented on the Internet. The dominance of English isn’t just the result of differing access: as the Internet was developed in the United States, the systems underlying the digital world were built in English. In UTF-8 – the standard encoding system on the Internet – the Latin alphabet is default, and requires the fewest bytes of information to encode per letter. Although pictorial languages convey more information per letter and remain efficient, LLMs use preexisting content to formulate their responses; while pictorial languages may be fairly efficient to encode, they require more processing power for AIs to understand because the pool of information available for context is smaller. Therefore, early US dominance of the Internet is likely to pay dividends for English speakers as generative technology continues to be adopted around the world.

To understand the scope of the language gap, the measurement system AIs use to compute costs must first be understood. To process outputs and inputs, LLMs use tokens, a discrete number of UTF-8 encoded bytes roughly equivalent to four English words. Tokens are used to interpret requests, respond, and contextualize conversations based on past queries. Because UTF-8 encodes Latin characters with the fewest bytes and models are parametrized using English words, requests are the cheapest, by the token, in English. Although certain pictorial languages like Mandarin are more information dense per character, they lose out to English’s overrepresentation online. According to OpenAIs Tokenizer Playground, the pangram, “The quick brown fox jumps over the lazy dog” costs 70% more tokens per character to interpret in Mandarin than in English. While most models offer subscription-based pricing for everyday consumers, commercial users pay by the token. Anthropic, for example, charges $5/million tokens for enterprise use. For businesses relying on AI, increases in costs leaves companies outside the anglosphere at a significant disadvantage. For languages outside the mainstream, the outlook is bleak. Cherokee – a language classified as endangered by UNESCO – costs over 13 times more tokens per character than English, and most LLMs hardly even return a response to queries in Cherokee. Barring a sudden explosion of Cherokee media on the internet, AI is likely to leave those who speak only Cherokee behind. If AI is unable to respond coherently to requests made in endangered languages, those languages become less enticing to learn.

The language gap is expanding. As AI models learn from user responses and are fine tuned by experts, overrepresentation of English now will lead to a greater chasm between English and the rest in the future. To fine-tune models and regulate the quality of content produced by AI, humans around the world are employed as filters, manually highlighting problematic content for AI. As highlighted content tends to be in English, filtering is also most effective in English. According to a paper published in 2024, GPT-4 responded to harmful requests 20% more frequently when asked in languages other than English. Although LLMs have improved significantly since 2024, the underlying principle still stands: AIs are better at filtering in English, and will remain so until those who filter content do so with a greater variety of languages in mind. Although it could be argued that minimal filtering encourages usage, companies don’t just filter their models because of government intervention: poor filters undermine trust in the product, which hinders user adoption. A smaller audience of users outside of the Anglosphere tells companies building AI models to shift their focus away from multilingual models, which perpetuates the cycle. Linguistic gaps are insidious because they accentuate themselves, festering until they are addressed head on.

Further exacerbating the language divide are the monstrous costs companies are swallowing in their quest to dominate the generative AI market. To understand queries and serve up responses, LLMs pore through petabytes of data, using gigawatts of energy and running mind-numbing calculations on cutting-edge Graphics Processing Units (GPUs) in the process. While costs for both energy and GPUs had already been soaring, the Strait of Hormuz blockage has caused the price of energy to explode. As energy is a factor of production for data centers, investment in AI has been hamstrung by the price of energy. What’s more, the private equity funding AI startups relied on to scale is drying up due to increases in interest rates and mixed investor sentiment. As tech titans become cost-conscious, focus on improving capabilities for customers speaking underrepresented languages may collapse.

Thanks to globalization, ideas can now be disseminated at a breakneck pace, and new technologies emulated rapidly. At the same time, those without access to these technologies are marginalized, left behind in the wake of an unapologetic system. As the rate of technological progress continues to accelerate, those people – indeed, those communities – who are unwilling or unable to learn another language and relinquish their culture are being left behind. As the newest step forward, Artificial Intelligence seems poised to do the same, albeit in a new way. As smaller communities are underrepresented in training data, the term “Artificial Intelligence” looks to be a misnomer. Due to inequities in the systems undergirding the Internet, AI reflects nothing more than preexisting biases, the compression of history by the victors through vectors.

The Languages Left Behind by AI

Recent Posts

Comments