23 Feb 2016

The Challenge of Disproportionate Language Representation Online

The Internet is humanity’s biggest ever repository of information, an enabler of communication between people across the globe and a way to access education, participate and change the world. But over 4 billion people have yet to access this colossal global resource.

According to the UN, a third of a billion users came online for the first time in 2015. But getting the next billion online may be much harder than the previous billion. Poverty, technology, remote populations, and language barriers are serious obstacles to global connectivity and web participation.

Of these factors, the lack of language representation online may be the biggest barrier improving internet participation.

The Internet remains dominated by English language content and the gulf between the language profile of the web and that of the real world is looking increasingly large.

It’s a situation that’s best summarised with this graph showing how the language split of the web in no way corresponds to the number of speakers of languages in the offline world.

This is an issue potentially more grave than any technological barrier: a fast broadband connection doesn’t mean much if very little content exists in your language.

Getting the next billion online

Growth in access to the Internet is slowing. It’s looking likely that the next big milestone – having 4 billion internet users connected – won’t be achieved before 2020.

That’s partly because the easiest populations have already been reached. The next tranche of humanity that’s coming online is going to be located in rural and remote areas, they’ll be poorer than those already online, and they’re going to be speaking some of the world’s more exotic languages. That places a number of barriers between them an internet participation, of which language representation is only one.

Presently only a small fraction of the world’s 7000+ languages are represented online in any meaningful way: perhaps less than 5% of the total. According to UNESCO, around 300 languages have any real representation online.

But even those few hundred languages that are online are not represented proportionately. Instead a few major languages, including English, Russian and German, dominate. Despite being spoken by the most people, Chinese languages remain in a minority online. Other languages with major population bases, such as Hindi and Arabic, are also not well represented in terms of the amount of online content available in these languages.

Some parts of the Internet are more linguistically varied than others. Google search can handle around 350 of the world’s languages, Facebook is adding more languages to the ones it can accommodate and Twitter more than doubled the number it can accommodate from 2012 to 2015.

These platforms are still only able to feature content in a relatively small number of the world’s languages. Whilst the Internet could potentially connect humanity in ways we’ve never seen before, the relative paucity of linguistic variation on some platforms is a threat to engagement. LinkedIn, one of the world’s major business networking platforms, can only handle about 24 languages. The world’s biggest trade platform, Alibaba, is available in little more than a dozen.

So what would it take to bring the rest of the world online? We’ll overlook the estimated ¾ billion people in the world who are probably functionally illiterate and focus merely on those currently literate. In language terms, there would need to be significant expansion in content in order to make the Internet linguistically relevant for the majority of the world’s literate population.

According to Facebook research from 2015, if we wanted to make Wikipedia available to 80% of the world’s population then the number of languages it is written in needs to increase from 52 to 92. Making the most popular 100,000 pages of Wikipedia available in 40 new languages would obviously represent a huge project.

But the issue is more complex than ‘merely’ multiplying a hundred thousand Wikipedia pages into 40 new languages. For users in many regions who are unable to reach Latin characters, the structure of the domain name system (DNS) itself presents challenges. The Internet was not originally designed to be multilingual and DNS was intended only to support the characters a-z, A-Z, 0-9 and the hyphen. Upgrading the DNS to support hundreds of thousands of characters of the world’s major languages is no small feat.

Although the spread of cheap basic handsets that can access the Internet is growing, many of the cheap handsets that are available may not be able to cater to certain languages.

It may also not be cost effective to cater to smaller language bases – especially those with a poorer population. Even though Hindi is spoken by a quarter billion people, many of them with good income levels by emerging market standards, smart phones that can handle Hindi characters have only emerged very recently. What hope then for languages with much smaller, poorer population bases?

Importance of linguistic diversity online

Making online resources available to speakers of all languages is important for many reasons. Having a major communication resource that’s only available in the larger world languages is likely to be a threat to the existence of smaller, more obscure languages in serious threat of extinction.

According to Ethnologue, an estimated 1,519 world languages are at risk and 915 are classified as ‘dying’. The world loses six languages a year as the last native speakers die out. An internet with narrow linguistic horizons is a threat to linguistic diversity, whilst the Internet could potentially help speakers of obscure languages maintain their languages if it can only accommodate them.

Having access to the Internet available in your mother tongue is also going to be a major driving force for participation online.

Many of the people not currently online may not see the value in internet access if there’s no content available in their native tongue. This is particularly true for poorer communities, where obtaining access may be a serious financial consideration. It’s especially true for less educated or more isolated populations that may only only speak the local language.

The availability of resources in your own language is also important to getting to grips with the new technology. According to Iris Orriss, Facebook’s Director of Internationalization and Localization, users are far more likely to complete their registration on a site if the content is available in their native language. That may be because they are more likely to understand instructions to complete the registration process if they are in their own language.

It remains vital that the world’s languages achieve representation online. Whilst this may be easier for the larger minority languages, we’re likely to see it becoming increasingly difficult to add new language populations to the Internet. Many factors will act as barriers to their participation, including technological and financial considerations. What’s clear is that there’s a long way to go before the Internet reflects the true linguistic picture of the human race.



 
 

Sign up to our newsletter

Get our blog articles straight to your inbox.