Why humans are crucial to MT quality

Why humans are crucial to MT quality


Assessing and monitoring quality of machine translation is a key element of developing a successful machine translation model. A model that returns poor quality is simply not useful. But how is that quality measured given its utmost importance?

Quality and speed with which machine translation is produced are the two elements that have been keeping researchers and providers busy for decades. The quality is measured when MT models are being created and trained but also on an on-going basis when those models are being used for live work in what is called “production environment”.

Quality assessment and MT

It is crucial to closely monitor the quality on a regular basis in the live environment as neural machine translation models that are adaptive will constantly evolve. Keeping a close eye on the quality allows to spot any anomalies and address issues if the model starts producing translations that are below the desirable quality threshold.

Automated scores

To help with that, there are a number of automated methods that are being used. They are mathematical algorithms that calculate a score indicating the expected quality in a matter of seconds. The automatic methods are popular due to their relatively low cost, both in financial sense and in terms of resources, and the speed with which they can be applied.

However, automatic scores such as BLEU and METEOR are not flaw-free. They do provide a good indication of the expected quality but they do not always converge with human perception of quality.

Therefore, to get a full picture, is advisable to combine the power of automated scores and human judgement.

Human assessment

It is important to note that human assessment also has its imperfections. Not only is it time-consuming and costly, but it also does not always provide a crystal clear picture.

Due to the fact that translation is an incredibly subjective matter, it is very likely that if you gave the same machine translated text to 10 different linguists, the scores they would give it would be very varied rather than all converge in one point.

Still, human assessment is a valid and relevant means of gaining knowledge of MT systems and monitoring the quality that they output.

Methods of human assessment

Let’s take a look at some of the most common types of human evaluation of machine translation.

A frequently used way of assessing MT by linguists is through assigning rating. Human evaluators are asked to rate translations using a pre-determined scale. This can be a scale of 1 to 10 or a percentage, where the lowest point on the scale indicates very poor quality and the highest excellent or flawless quality.

The criteria needs to be very clearly specified. For instance, whether the score given should reflect all linguistic qualities such as grammar, punctuation, accuracy, fluency, etc. or only some of them – accuracy and adequacy for example.

Close up of quality control form and marker

There are several methods of assessing machine translation quality with the help of human experts.

Another way in which human evaluators assess machine translation is by judging its adequacy – how much of the meaning expressed in the source text is retained in the target text.

In this case, the annotator has to be fluent in both source and target language to accurately judge the messaging retained. The scale applied normally ranges from “all meaning retained” through to “none of the meaning retained”.

A different way of assessing the machine translation output is looking at its fluency. This method is different to assessing adequacy as it only focuses on the target text; therefore the annotator does not need to understand the source language.

The annotator is presented with the question “Is the language in the target text fluent” and the scale will typically range from “flawless” to “incomprehensible”.

ssessment word on notebook, wooden ruler and pencil on blue wooden background. Top view. Business concept.

The quality assessment may focus only on certain aspects of the translation such as fluency or adequacy.

Error analysis is one of the most comprehensive means of human evaluation. Evaluators identify errors and indicate the category of each error. The error typology might depend on a specific language as well as content type.

Each error category will have a different severity assigned to it. This means that some errors might be perceived as not impacting on the meaning, while others will distort the message that is meant to be conveyed and therefore have high severity.

Some examples of error categories would be “grammar – wrong subject-verb agreement”, “grammar – incorrect word order”, “punctuation – missing full stop”, “accuracy – omitted word”.

The final human evaluation method is ranking, where evaluators are given two or more translation options and are asked to pick the best option. This can sometimes prove difficult if there are only very slight differences between the translation options or if they contain errors that are difficult to compare.

In those instances, it might help annotators to decide which translation features errors that have greater impact on translation quality.

Human evaluation of machine translation, in conjunction with automated measures, provides an important insight into the performance of MT systems.

This is another way in which linguists can diversify their services – they can become machine translation evaluators and therefore have a significant impact on what machines will output. Without linguists, perfecting machine translation would be a much more difficult task.

Related posts

Subscribe to our newsletter