What Language Does Your Robot Speak?
The Internet today is more diverse than ever, but AI could change that
There are an estimated 7,000 languages in the world. Thousands of Louvres carrying in them the collective humor, experience, and wisdom of millennia.
Just 100 of these are spoken by around half the world. Even so, around four billion people rely on one of the thousands of languages outside that set. Languages that, without Herculean effort, may never see the digital light of day.
It’s commonly called the “World Wide Web” — but when it comes to the languages used (and even available) online, how “world-wide” is it really?
This is an important question. For one, online language representation creates greater inclusion in the digital space. And this is important for realizing the overall promise of the internet: to learn, share and connect with people and information from all around the world. A big part of why it’s exciting to see visually-impaired Indians, Cuban artists, Nigerian mothers, and Syrian schoolteachers come online isn’t just that they can enjoy the internet’s fruits — it’s that we can learn from them, too.
Even more crucially, a dearth of online content in languages with few speakers may even speed up their extinction, as we come increasingly rely on digital spaces for…. everything. As computational linguistics professor David Yarowsky says, “If a Quechuan kid in Peru can’t surf the web in Quechua, they’ll use Spanish. This is a rapidly accelerating train.”
And the train speeds up in a world ever-more reliant on artificial intelligence, where more content in a given language creates better AI utilizing it — and less content in a language means it will see even less proportional representation.
The musician Holly Herndon brilliantly frames artificial intelligence as a kind of digital representation of our aggregate or collective intelligence. But what happens to the collective, aggregate intelligence of so-called “low-resource” or “under-resourced” languages online if there’s no AI to represent them?
I’ve written the words of linguist Ken Hale in the past, but they bear repeating:
“Languages embody the intellectual wealth of the people that speak them. Losing any one of them is like dropping a bomb on the Louvre."
There are an estimated 7,000 languages in the world. Thousands of Louvres carrying in them the collective humor, experience, and wisdom of millennia. Just 100 of these are spoken by around half the world — but even so, a phenomenal four billion people use one of the thousands of languages outside that set. Languages that, without Herculean effort, may never see the digital light of day.
Many languages don’t even yet have access to online representation. Unicode, the standard for online text and emoticons, supports over 160 scripts, and more continue to be added. But as Sanjeev Khudanpur of Johns Hopkins’ Center for Speech and Language Processing says, “Collectively, the population speaking languages lacking suitable digital alphabets could number a few billion people.”
Felix Laumann writes that only about twenty of the 7,000 languages spoken around the world “have text corpora of hundreds of millions of words” — that is, enough online presence to build and maintain robust AI models. And Khudanpur of Johns Hopkins puts it more starkly: “Many languages beyond the top 20 or 30 most-spoken… don’t work with Siri or Alexa. No one is developing chatbots in these languages.”
Back in 2013, scholar András Kornai argued the “consensus figure vastly underestimates the danger of digital language death,” and predicted “a massive die-off caused by the digital divide.”
And last year Bhanu Neupane, who works with language inequity at UNESCO, warned that ‘The world is converging…. after 15 years, there could be just five or 10 languages that are prominently spoken and used in business and online. So we’re very concerned about this.”
So what’s to be done?
Increased digital language support helps, and Google has happily made great strides building out capabilities for “long-tail” language and script representation online. Google Translate already supports over 100 languages, though this is of course nowhere near 7,000. And in late 2022 the company announced its “1,000 Languages Initiative,” a plan to build an AI language model to drastically increase language support that’s seemingly making progress.
Even this leaves the vast majority of long-tail languages out. But it’s at least much better than the mere 250 or so Kornai predicted would survive online in 2013.
And outside of language online, some programmers have gone out of their way to increase the diversity of languages available to code software itself. Historically this has been conducted in English, no matter where in the world a programmer sits.
Qalb (قلب), meaning “heart,” is an extraordinary example allowing coders to write in Arabic. As the languages’s creator Ramsey Nasser says, “Computer science was born largely in England and software engineering was really fostered in the United States… If we are going to really push for coding literacy, which I do; if we are going to push to teach code around the world, then we have to be aware of what the cultural biases are and what it means for someone who doesn't share that background to be expected to be able to reason in those languages.”
قلب also looks beautiful. Nasser created the language in such a way that programmers can extend the length of any letter to create visual patterns in the code, something impossible in English-based programming. A sort of code calligraphy, just like Arabic written anywhere else.
Wikipedia contains a whole list of non-English coding languages, which is fun to peruse (though I have no idea how commonly any of them are used.) And there are even some non-language coding languages like Brainfuck, which is about as hard to comprehend as the name implies.
And Michael Running Wolf is striving to build AI to reclaim native languages. He argues that even as speakers of indigenous languages dwindle, artificial intelligence can be used to retain them into the future — and even reintegrate them back into daily life:
“And what if you could talk to your smart lightbulbs and say, in Lakota or Cheyenne, "Turn on the house lights"? It would make that indigenous language part of your life, rather than the language of ceremony. That's what some of these languages have been relegated to—they’re no longer a day-to-day language except for a small pool of speakers.”
So there’s light at the end of the tunnel for language diversity online, after all — though at times it can look very long, and rather dark.
Song of the Week: Tacocat — Talk