Google is currently working on ways to refine how computers read language.
Scientists at the internet service giant are providing a series of tools for researchers to help PCs, tablets and mobile phones recognise and know the difference between certain words and their context.
A large number of Americans, for instance, pronounce ‘ladder’ and ‘latter’ identically, meaning it can be extremely difficult to differentiate between the two.
There is also a similar problem when it comes to keyboard inputs on mobile devices, especially for IME keyboards. The input patterns for ‘Yankees’ and ‘takes’, for example, look alike as users slide their fingers across the keypad. This makes it more difficult for devices to know or predict exactly what the user is trying to write.
Technology website eWEEK.com reports that Google is therefore contributing data sets that researchers can use to try to refine how computers read and hear words in a bid to tackle the problem.
One way computers use context is with language models – technology that assigns a probability to a sequence of words by means of a probability distribution.
This is used in predictive keyboards, as well as many other natural language processing applications including speech recognition, machine translation, spelling correction, query suggestions and information retrieval.
But one thing that can be complicated when evaluating the quality of such complex systems is error attribution.
Google believes having a large, standard set of words with benchmarks for easy comparison and experiments with new modelling techniques could be a potential way of improving language modelling for computers.
It is therefore releasing scripts that convert a set of public data into a language model consisting of over a billion words, with standardised training and test splits, while also releasing the processed data in one convenient location, along with the training and test data.
The idea is to make it easier for the research community to quickly reproduce results. All the benchmark scripts and data are freely available to all researchers who want to work with the data set.
New and better standard benchmark
Google hopes to create a new and better standard benchmark for language modelling experiments.
Comparisons will be easier and more accurate as more researchers use the new benchmark, while progress will also be faster.
Researchers currently report from a set of their choice. This means results are very hard to reproduce because of a lack of a standard in processing.
Dave Orr, Google Research product manager, and Ciprian Chelba, a Google research scientist, are encouraging researchers to use the new benchmark as they find improved ways to help machines figure out the context of searches and inquiries.