NIST websiteMT08 Online Evaluation Platform

Guidelines for making adequacy judgments

The Adequacy assessments are concerned with how well the meaning of a correct human Reference translation is preserved in a Machine translation of the same sentence. We are asking for a two-part adequacy judgment: The first part is of a more quantitative nature regarding how much of the information from a Reference translation is preserved in a Machine translation. The second part asks for a more global qualitative judgment of the Machine translation.

Adequacy Judgment, Part 1/2: Quantitative Judgment

You will be presented with a Reference translation of a sentence followed by a Machine translation of the same sentence.

Overlap of words or word sequences between the Reference translation and the Machine translation will be highlighted. It is up to you whether to make use of the highlighting as a visual aid for making your judgment.

Judge the adequacy of the Machine translation compared to the Reference translation by answering the question:

How much of the meaning expressed in the Reference translation is also expressed in the System translation?

Select one of the seven scale points ranging from None to All.

Example:

Reference translation:

Erdogan Confirms Turkey Will Resist Any Pressure to Recognize Cyprus

System translation:

Ardogan Stresses That Would Reject Any Pressure on Turkey to Urge Recognition of Cyprus

How much of the meaning expressed in the Reference translation is also expressed in the System translation?

All
Half
None


Give an intuitive judgment as much as possible; do not spend a long time pondering your decision.

We are looking for an estimate rather than an exact count of elements of meaning.

You should complete each of these judgments in less than 30 seconds.

When determining how much meaning is preserved, weigh parts that are more important more than parts that are less important. For example, a missing article is likely to be less important, and thus takes away less of the meaning, than a missing main verb or proper name.

If you find that you would like to give a judgment that would lie between two scale points, pick the scale point that is closest. For example, if you think that almost all of the meaning is expressed, you may decide to give a score of All. Similarly, if almost none of the meaning is expressed, you may decide to give a score of None.

Also, if parts of the meaning are present, but the output as a whole is completely unintelligible to you, a score of None would be appropriate.

Sometimes a translation may contain untranslated words from the source language. This often means that what this would have translated to is missing. Treat such untranslated strings as though they are not present, and penalize for any missing aspects in English as you would normally.

Note that we are not asking you to judge the fluency of the machine translation. If the grammatical structure of a Machine translation is awkward or even wrong, but the meaning can still be understood, the ungrammaticality should not influence your judgment.

Adequacy Judgment, Part 2/2: Qualitative Judgment

If your score is higher than Half, you will be asked to make an additional judgment about the sentence by selecting Yes or No for the question:

Does the Machine translation mean essentially the same as the Reference translation?

Example:

Reference translation:

Erdogan Confirms Turkey Will Resist Any Pressure to Recognize Cyprus

System translation:

Ardogan Stresses That Would Reject Any Pressure on Turkey to Urge Recognition of Cyprus

How much of the meaning expressed in the Reference translation is also expressed in the System translation?

All
Half
None

Does the Machine translation mean essentially the same as the Reference translation?

Yes
No


Note that the term essential indicates that smaller aspects of meaning may be altered, yet the answer to the question can still be Yes if those are not considered of critical importance. It is helpful to remember that different aspects of a sentence differ in their likelihood to impact meaning.

For example, a missing or incorrect article is not very likely to impact the meaning enough to be considered a change in essential meaning. A missing or added negation marker or a wrong main verb, on the other hand, is almost certain to change the essential meaning. Changes in ordering may or may not impact whether the meaning is still essentially the same or not.

Note that you may have given a sentence a high score on the first question, but will still select No for the second question in some cases.

This is likely if the Machine translation misses only a small but very important piece of meaning (such as a negation element), or has all the information from the Reference translation but also adds information that the Reference translation does not have.

To guide your decision on this assessment, you can ask yourself the question:

If I only had the Machine translation, would I still arrive at the same understanding of the sentence that the Reference translation conveys?

If you gave a score of Half or lower on the original question, you do not answer the additional binary question.

After making your selection(s), click Proceed to next segment to go to the next item.

On the example page, you will find a list of examples of sentences falling under the different assessment categories, along with some explanation of why they would best fit under that category. Not every aspect leading to a certain score is pointed out for every example.

Keep in mind that these are only guidelines and cover only a small sample of the kinds of quality of Machine translation output that you will see. Also, remember that there is not always one correct score to assign. In several of the examples, one could reasonably argue for for a slightly different score. Use these guidelines to help you form your best judgment, but not to dictate your judgment.

Proceed to the example page