The Languages Left Behind by AI
- Puran Singh
- May 7
- 5 min read

Thanks to ferocious investment and a wellspring of talent in Silicon Valley, the United States is dominating the AI market. Although others like China have sought to scale, the top ten most popular AI models are American. However, as the US continues to forge ahead, others risk being silently left behind.
To generate responses, AI models employ a more complicated version of what’s called a word n-gram approach, whereby the most probable next word in a sentence is chosen given previous context. To determine the most probable next word, AI models scrape reams of language data from the Internet – hence the common description of AI as a Large Language Model (LLMs). As most of the Internet is English, models trained on the broadest corpus use fewer tokens – units of compute for AI – in English than other languages. By the same token, the languages most underrepresented on the Internet are left the furthest behind in training datasets.
Furthermore, models are refined by human feedback and any incongruities between responses in different languages may go unnoticed. As languages across the world were already becoming less diverse before AI, many may feel forced to turn their back on their ancestral languages and culture and learn tongues with more training data. Frequently, tech executives herald AI as the ultimate educational tool, painting vivid pictures of children without access to education using AI to discover the world. However, given the emerging language gap, many linguistic minorities around the world may be left behind in the AI race.
The crux of the issue lies in the dominance of the English language on the Internet. According to W3Tech, an annual global Internet survey, approximately half of all web traffic is in English. Although it may not feel like it in the West, English is only spoken by around 20% of the world population and is therefore grossly overrepresented on the Internet. The dominance of English isn’t just the result of differing access: as the Internet was developed in the United States, the systems underlying the digital world were built in English. In UTF-8 – the standard encoding system on the Internet – the Latin alphabet is default, and requires the fewest bytes of information to encode per letter. Although pictorial languages convey more information per letter and remain efficient, LLMs use preexisting content to formulate their responses; while pictorial languages may be fairly efficient to encode, they require more processing power for AIs to understand because the pool of information available for context is smaller. Therefore, early US dominance of the Internet is likely to pay dividends for English speakers as generative technology continues to be adopted around the world.
To understand the scope of the language gap, the measurement system AIs use to compute costs must first be understood. To process outputs and inputs, LLMs use tokens, a discrete number of UTF-8 encoded bytes roughly equivalent to four English words. Tokens are used to interpret requests, respond, and contextualize conversations based on past queries. Because UTF-8 encodes Latin characters with the fewest bytes and models are parametrized using English words, requests are the cheapest, by the token, in English. Although certain pictorial languages like Mandarin are more information dense per character, they lose out to English’s overrepresentation online. According to OpenAIs Tokenizer Playground, the pangram, “The quick brown fox jumps over the lazy dog” costs 70% more tokens per character to interpret in Mandarin than in English. While most models offer subscription-based pricing for everyday consumers, commercial users pay by the token. Anthropic, for example, charges $5/million tokens for enterprise use. For businesses relying on AI, increases in costs leaves companies outside the anglosphere at a significant disadvantage. For languages outside the mainstream, the outlook is bleak. Cherokee – a language classified as endangered by UNESCO – costs over 13 times more tokens per character than English, and most LLMs hardly even return a response to queries in Cherokee. Barring a sudden explosion of Cherokee media on the internet, AI is likely to leave those who speak only Cherokee behind. If AI is unable to respond coherently to requests made in endangered languages, those languages become less enticing to learn.
The language gap is expanding. As AI models learn from user responses and are fine tuned by experts, overrepresentation of English now will lead to a greater chasm between English and the rest in the future. To fine-tune models and regulate the quality of content produced by AI, humans around the world are employed as filters, manually highlighting problematic content for AI. As highlighted content tends to be in English, filtering is also most effective in English. According to a paper published in 2024, GPT-4 responded to harmful requests 20% more frequently when asked in languages other than English. Although LLMs have improved significantly since 2024, the underlying principle still stands: AIs are better at filtering in English, and will remain so until those who filter content do so with a greater variety of languages in mind. Although it could be argued that minimal filtering encourages usage, companies don’t just filter their models because of government intervention: poor filters undermine trust in the product, which hinders user adoption. A smaller audience of users outside of the Anglosphere tells companies building AI models to shift their focus away from multilingual models, which perpetuates the cycle. Linguistic gaps are insidious because they accentuate themselves, festering until they are addressed head on.
Further exacerbating the language divide are the monstrous costs companies are swallowing in their quest to dominate the generative AI market. To understand queries and serve up responses, LLMs pore through petabytes of data, using gigawatts of energy and running mind-numbing calculations on cutting-edge Graphics Processing Units (GPUs) in the process. While costs for both energy and GPUs had already been soaring, the Strait of Hormuz blockage has caused the price of energy to explode. As energy is a factor of production for data centers, investment in AI has been hamstrung by the price of energy. What’s more, the private equity funding AI startups relied on to scale is drying up due to increases in interest rates and mixed investor sentiment. As tech titans become cost-conscious, focus on improving capabilities for customers speaking underrepresented languages may collapse.
Thanks to globalization, ideas can now be disseminated at a breakneck pace, and new technologies emulated rapidly. At the same time, those without access to these technologies are marginalized, left behind in the wake of an unapologetic system. As the rate of technological progress continues to accelerate, those people – indeed, those communities – who are unwilling or unable to learn another language and relinquish their culture are being left behind. As the newest step forward, Artificial Intelligence seems poised to do the same, albeit in a new way. As smaller communities are underrepresented in training data, the term “Artificial Intelligence” looks to be a misnomer. Due to inequities in the systems undergirding the Internet, AI reflects nothing more than preexisting biases, the compression of history by the victors through vectors.


.png)



sunwin hôm bữa mình cũng chỉ tò mò vào nghía thử chứ không có ý định ngồi xem kỹ từng thứ. Vừa mở lên thấy trang làm khá “dễ thở”, khoảng trắng nhiều nên nhìn không bị rối mắt. Mình thích nhất là cách họ chia nội dung thành từng khối rõ ràng, kiểu lướt một vòng là biết phần nào nằm ở đâu, không phải kéo lên kéo xuống mệt. Mấy mục chính được gom lại gọn gàng nên bấm qua lại cũng nhanh, cảm giác như họ ưu tiên cho người mới vào khỏi bị ngợp. Nói chung mình chỉ xem giao diện thôi mà thấy ổn áp, nhìn thân thiện kiểu không cần suy nghĩ nhiều. Menu…
Reading about endangered languages being sidelined by AI’s English bias is eye-opening, and kablora would be a great space to discuss how we balance AI progress with preserving global linguistic diversity.
ku win dạo này thấy nhắc hoài nên mình cũng bấm vô xem thử cho biết. Ấn tượng đầu là trang chủ nhìn khá thoáng, chia khối nội dung rõ nên lướt xuống không bị rối mắt. Mình có ghé qua mục Hướng Dẫn & Giải Đáp, kiểu hỏi đáp ngắn gọn nên đọc nhanh vẫn nắm được ý, nhất là đoạn họ nói các sảnh chạy thuật toán RNG và kết quả được đối soát tự động, nghe cũng yên tâm hơn chút. Không phải kiểu chữ nhiều quá làm ngại đọc. Nói chung giao diện không màu mè, menu đặt dễ thấy, và mấy box FAQ trong mục Hướng Dẫn & Giải Đáp hiển thị khá rõ ngay…
AI is advancing fast, but because most online data is in English, many smaller geometry dash and endangered languages are being ignored, making the digital divide even bigger.
This is a really thought-provoking piece on how AI development might overlook less common languages. It makes you wonder about the long-term impact on linguistic diversity. For anyone interested in AI's creative potential, checking out anytovideo might offer a different perspective on AI and media generation.