The Language Barrier

There are parallels between the myth of the Tower of Babel and the modern linguistic challenge of the internet. In this India has a unique need for translation technology, given its linguistic diversity. Bhashini may be the answer.

This article was first published in The Mint. You can read the original at this link.


One of the best known origin myths about the source of languages is the story of the Tower of Babel. It goes something like this:

There was a time when everyone on Earth spoke the same language, and, united in this manner, decided to build a tower so tall it could reach the heavens. They made such rapid progress on the tower that Yahweh, the God of Israel, was said to have remarked, “Indeed the people are one and they all have one language, and this is what they begin to do; now nothing that they propose to do will be withheld from them.”

Yahweh had to stop the construction of the tower or else mortals would think themselves the equal of God. So he created multiple languages, dividing them into different linguistic groups of people who  couldn’t understand a word the other spoke.

Work on the tower came to a grinding halt and everyone drifted away into their own small groups, never to unite again.

An English Internet

Today there are over 7,000 languages in the world and content created in one language is incomprehensible to those who speak another. With the internet becoming the largest store of information on the planet, the fact that the world speaks so many languages is becoming a real challenge.

Even though English comprises just 16% of the speaking population, it accounts for over 60% of the top 10 million websites on the internet. On the other hand, despite the fact that China has the largest number of internet users, just 1.4% of the top 10 million websites use Chinese. Hindi, the third most spoken language in the world

It is possible that all of this is just temporary. The internet was invented in the English-speaking world, so it stands to reason that the majority of its content would be in English. As the internet becomes more linguistically diverse, this is already showing signs of change. In 1996, 80% of internet users spoke English. By 2010, that dropped to 27.3%. Today, 12 times more people in China and 25 times more people in the Arabic-speaking world use the internet than they did in 1996. It seems inevitable that the language of online content will follow suit.

But building a more linguistically representative internet is not the solution we are looking for. That will require us to create new content in a wider range of languages, when what we need is to make all the information that is already in existence, understood by a wider range of people.

We need translation technology that can consistently (and with a high level of accuracy) ensure that content in one language is understandable to those who speak another, so that it no longer matters what language the content was created in.

India Needs Translation Tech

Nowhere is the need to solve this more urgent than in India. With over 3,000 languages, the only way to ensure that our development goals reach all corners of the country is to ensure that no content is out of someone’s reach solely on account of the language they speak.

Earlier this year, the government of India launched Bhashini, a digital public platform for languages designed to ensure that digital content can be delivered in all Indian languages using artificial intelligence (AI) and allied technologies for speech-to-speech translation. To achieve this at scale, we will need to generate vast training datasets of text and speech in multiple Indian languages.

The Bhashini project has launched Bhasha Daan, a crowdsourcing platform through which volunteers can support the project either by translating texts into languages they understand, contributing spoken words in languages they are familiar with, or by labelling images in other languages.

As innovative as this is, it is unlikely to be sufficient. To solve a problem of this magnitude, we need to work with large datasets of annotated information that accurately cross-references speech or text in one Indian language with that in several others. This will allow us to build and train AI models to translate quickly.

Training Datasets

One obvious source for this is the archives of All India Radio and Doordarshan, organisations which, for decades, have created content in the multiple regional languages at the same time. Past recordings of the daily news alone will give us comparative samples of roughly the same content spoken in multiple different languages. I have no doubt that many other sources of similar annotated language data exist in the private domain.

One possible impediment to this is copyright law that prevents content from being used without the permission of the owner of the work. Several countries (JapanGermanyIreland, and Estonia for example) have amended the fair-use provisions of their copyright statutes to create an exception for data analysis—to allow non-commercial use of this data for the purpose of creating training data sets.

India should consider amendments along similar lines in order to incentivise more innovation in the field of language translation.

BabelFish

Douglas Adams, one of my favourite science fiction authors, had introduced a typically irreverent literary device into his Hitchhiker’s Guide to the Galaxy series of books, to solve the translation problem at a galactic scale. In his imaginary world, everyone has a BabelFish, a symbiotic creature that lives in your ear and translates all communication signals it hears into a language you can understand. As a result, every species, including the most exotic extraterrestrials, can understand each other.

I’ve always though it would be cool to have a BabelFish in my ear wherever I travel.

Maybe Bhashini will make that possible.