6 Comments
User's avatar
hohoda's avatar

the "language tax" framing is sharp. it's invisible at the UI layer, which is exactly why it's easy to declare AI "democratized" while the infrastructure underneath is still heavily English-first.

though i'd push back a bit on the "build your own LLM" path as the solution — that's only accessible to a handful of countries with the capital, compute, and talent (China, France, UAE, maybe Japan). for everyone else, that bar is unreachable.

the more tractable fix might actually be at the tokenizer level — train tokenizers on balanced multilingual corpora rather than English-dominant ones. DeepSeek, for instance, has a tokenizer optimized for Chinese, which changes the economics without requiring every country to build from scratch. curious if you see that as viable, or do you think the data imbalance runs too deep to fix from the tokenizer up?

Asli Öztürk's avatar

thank you for sharing your insights, you got me thinking on that:)

I agree that building your own LLM may not be easy for many countries, because of the reasons you have mentioned. When I was doing my master studies, my graduation project was about "Sentiment Analysis on Turkish Texts", and I have had a small dataset to train on my laptop that has GPU. So I was doing NLP and it would take so much time to train that small dataset, I can't imagine the cost implication to build an LLM.

Although I am not an ML Engineer, I can only talk from my experience without pretending that I know everything:d I do believe that data imbalance is very real and deep, and I also think that if DeepSeek's has a Chinese optimized tokenizer, it is probably not useful for Turkish language because the problem with Turkish language is agglutination. How a tokenizer (from DeepSeek or OpenAI) can parse a Turkish sentence without losing its semantic precision? Or am I missing something?

hohoda's avatar

You're raising a really valid point. To clarify something I should have been more precise about earlier — in traditional NLP pipelines (think spaCy, NLTK, or even standalone BERT fine-tuning), tokenizers are relatively modular and can be swapped or customized more freely.

But in modern LLMs, the tokenizer is inseparable from the model itself. The vocabulary is baked into the embedding layer at training time — every token ID maps directly to a row in the embedding matrix. So replacing the tokenizer would invalidate the entire weight structure, meaning you'd need to retrain from scratch anyway.

This makes the agglutination problem even harder to solve for Turkish in the LLM era — the only real path is building (or continued pre-training of) a model with a Turkish-aware tokenizer from the ground up, which circles back to the cost and resource barriers you mentioned.

hohoda's avatar

side note — my native language isn't English either, so this isn't just theoretical for me. the tax is real.

Tsetsy's avatar

This is such an important point. We talk about ‘democratizing AI’ but rarely about the hidden costs baked into the system itself

Asli Öztürk's avatar

Exactly! I think we have long way to go when it comes to democratizing AI