Prevailing wisdom states that 70-80% of the web’s content is in English but the linguistic breakdown of the internet is surprisingly hard to verify. Many of the early studies were based on random page sampling, and this method isn’t as valid when massive social media sites such as Facebook can span multiple languages.
Other efforts to categorize the language breakdown of the web have focused on counting the instances of unique words in different languages being used in web content. This method looks at how many times a single word is used in its English, French and Chinese versions across the web.
When this method was first employed in the nineties it seemed to be the case that 80% of web content was in English. Ongoing research using the same method has shown a continuous fall in the proportion of web content that is in English. By 2005 only 45% of content was thought to be in English and the current estimate is under 40%. Considering that only around 5% of the world population is a native English speaker, with around 20% thought to have some competency with the language, there may still be some way to go in making the internet more accurately reflect the world’s linguistic variation.
A trend has been identified showing that web usage grows fastest in countries where English is not the dominant language. By 2010, only just over a quarter of web users were native speakers of English – compared to over 80% in 1996. There are several factors behind this shift in the linguistic profile of web content. The rise of user-generated content has probably played a part in expanding linguistic diversity. Whilst users might be prepared to interact with sites written in other languages, when it comes to generating content they mostly want to do so in their own mother tongue.
Dramatic growth in internet access for speakers of languages such as Arabic and Chinese will also mean English is no longer the dominant language it once was online. There is plenty of progress still to be made in bringing emerging markets online, and many of their citizens will speak languages other than English. This represents a huge linguistic group still to opt in to the world wide web and bring their own language needs with them.
It’s pretty clear that there’s still quite some way to go before the world’s linguistic diversity is properly reflected online, with even some of the world’s most commonly-spoken languages still not catered for online. Widely spoken languages such as Arabic and Hindi still only account for a very small proportion of online content. Only around a quarter of Malay speakers are thought to have internet access, despite being one of the world’s most widely-spoken languages.
How can we bring more languages online?
So what needs to happen to effect change? Achieving greater linguistic diversity online is probably going to require more than just enabling access for speakers of languages such as Malay. It will also be necessary to move away from the present situation where content creation is centralized both geographically and linguistically.
Because the development of localised content is costly and bears a significant amount of risk, a new model may need to evolve to find ways to distribute and monetize this new diversity of content.
Search technology will also need to adapt to the new linguistic profile of the internet. It’s already difficult for search engines to fully index social media networks, meaning some web content remains invisible. What content they do manage to index tends to favour English content, partly because it is more profitable from an advertising perspective. This may be another challenge to overcome when new language groups need to be served their own content.
Some of the larger multinational platforms are making efforts to expand their offering to target the bigger languages. Google is actively targeting speakers of Indian languages, in particular Hindi, in the hope of grabbing this emerging market in the early stages. This kind of drive to incorporate new language groups is clearly only going to be worthwhile doing for larger language markets. Google believes that 500 million internet users could be reached by its activities in India, from an emerging middle class with potential spending power, so it’s obviously worth the investment to target the most widely-spoken Indian languages.
Facebook has also expanded the number of languages it caters for – presently it can handle about 70 of the world’s 7000 languages. To expand language options the social media giant has opened its translation application to volunteer translators. In theory this translation model allows for it to quickly add to the number of languages it offers, however in practice the numbers of new language options being offered are limited. It isn’t clear what Facebook’s plans are to expand its language offerings further.
In any case, it may not be the case that existing players are best placed to expand their services to encompass other languages. Czech Republic’s native search platform, Seznam, claims that it is its local knowledge that makes it possible to compete successfully against the global giant. Seznam offers features important to local users, such as daily updated local maps, which the giant competitor is unable to offer to a market that size. Seznam has 1000 employees and it may be the case that Seznam is operating in a market large enough for a native operator to thrive but too small for a larger multinational to invest much energy in challenging.
This begs the ugly question whether only profitable language groups are going to be catered for with their own language content online. That is potentially a risk unless we find new ways to create, distribute and monetize content for smaller linguistic populations.