Assessing and monitoring quality of machine translation is a key element of developing a successful machine translation model. A model that returns poor quality is simply not useful. But how is that quality measured given its utmost importance?
Quality and speed with which machine translation is produced are the two elements that have been keeping researchers and providers busy for decades. The quality is measured when MT models are being created and trained but also on an on-going basis when those models are being used for live work in what is called “production environment”.
Quality assessment and MT
It is crucial to closely monitor the quality on a regular basis in the live environment as neural machine translation models that are adaptive will constantly evolve. Keeping a close eye on the quality allows to spot any anomalies and address issues if the model starts producing translations that are below the desirable quality threshold.
To help with that, there are a number of automated methods that are being used. They are mathematical algorithms that calculate a score indicating the expected quality in a matter of seconds. The automatic methods are popular due to their relatively low cost, both in financial sense and in terms of resources, and the speed with which they can be applied.
Therefore, to get a full picture, is advisable to combine the power of automated scores and human judgement.
It is important to note that human assessment also has its imperfections. Not only is it time-consuming and costly, but it also does not always provide a crystal clear picture.
Due to the fact that translation is an incredibly subjective matter, it is very likely that if you gave the same machine translated text to 10 different linguists, the scores they would give it would be very varied rather than all converge in one point.
Still, human assessment is a valid and relevant means of gaining knowledge of MT systems and monitoring the quality that they output.
Methods of human assessment
Let’s take a look at some of the most common types of human evaluation of machine translation.
A frequently used way of assessing MT by linguists is through assigning rating. Human evaluators are asked to rate translations using a pre-determined scale. This can be a scale of 1 to 10 or a percentage, where the lowest point on the scale indicates very poor quality and the highest excellent or flawless quality.
The criteria needs to be very clearly specified. For instance, whether the score given should reflect all linguistic qualities such as grammar, punctuation, accuracy, fluency, etc. or only some of them – accuracy and adequacy for example.
Another way in which human evaluators assess machine translation is by judging its adequacy – how much of the meaning expressed in the source text is retained in the target text.
In this case, the annotator has to be fluent in both source and target language to accurately judge the messaging retained. The scale applied normally ranges from “all meaning retained” through to “none of the meaning retained”.
A different way of assessing the machine translation output is looking at its fluency. This method is different to assessing adequacy as it only focuses on the target text; therefore the annotator does not need to understand the source language.
The annotator is presented with the question “Is the language in the target text fluent” and the scale will typically range from “flawless” to “incomprehensible”.
Error analysis is one of the most comprehensive means of human evaluation. Evaluators identify errors and indicate the category of each error. The error typology might depend on a specific language as well as content type.
Each error category will have a different severity assigned to it. This means that some errors might be perceived as not impacting on the meaning, while others will distort the message that is meant to be conveyed and therefore have high severity.
Some examples of error categories would be “grammar – wrong subject-verb agreement”, “grammar – incorrect word order”, “punctuation – missing full stop”, “accuracy – omitted word”.
The final human evaluation method is ranking, where evaluators are given two or more translation options and are asked to pick the best option. This can sometimes prove difficult if there are only very slight differences between the translation options or if they contain errors that are difficult to compare.
In those instances, it might help annotators to decide which translation features errors that have greater impact on translation quality.
Human evaluation of machine translation, in conjunction with automated measures, provides an important insight into the performance of MT systems.
This is another way in which linguists can diversify their services – they can become machine translation evaluators and therefore have a significant impact on what machines will output. Without linguists, perfecting machine translation would be a much more difficult task.