Go Back

Correlation Results

Human Assessment Type: Adequacy, 7-point scale, straight average

Judges were presented with a reference translation and a candidate sentence to evaluate. They answered the following "quantitative" question: "How much of the meaning expressed in the Reference translation is also expressed in the System translation?" on a 7-point scale ranging from 1 (None) to 7 (All).
Each segment assessed received at least two judgments from two different judges.
The segment score is the average of all (two or more) scores given on this segment. The document or system score is computed as the weighted (by segment length) average of segment scores

Additional charts:

Target Language: English

Human Assessment Type: Adequacy, Yes-No qualitative question, proportion of Yes assigned

The 7-point scale adequacy question described above was followed by a second, more "qualitative" question: "Does the Machine translation mean essentially the same as the Reference translation?". Judges did not have to answer this binary Yes/No question if their answer to the preceding question was 4 (Half) or less, in which case the answer was considered to be 'No' by default.
Each segment assessed received at least two judgments from two different judges.
The score is the number of 'Yes' assigned divided by the total number of judgments. The proportion is computed identically for segment, document and system level scores.

Additional charts:

Target Language: English

Human Assessment Type: Preferences, Pair-wise comparison across systems

Two candidate translations of the same segment from two different systems were presented to a judge, along with a reference translation. The judge decided which candidate translation he/she prefers, with 'No preference' available as a third choice.
A full pair-wise comparison across systems was performed on a selected number of segments.
The segment score represents the number of times the given system segment was preferred, divided by the number of judgments involving this same system segment. The proportion is computed identically for segment, document, or system-level scores.

Target Language: English

Human Assessment Type: Adjusted Probability that a Concept is Correct

Source text low level concepts were identified beforehand. Several bilingual judges (5 for Farsi, 6 for Arabic) then looked for these concepts in the candidate sentences, comparing it against the annotated source sentence and marking deletions, substitutions, and insertions.
A segment score is the number of correctly conveyed concepts, divided by the total number of concepts identified in the source sentence (including concepts identified by the judges as inserted concepts). Measures are aggregated for document and system scores.

Target Language: English

Human Assessment Type: Adequacy, 4-point scale

A set of bilingual judges (5 for Farsi, 6 for Arabic) graded the adequacy of a candidate translation by comparing it to the source sentence in a two-step process, first identifying the candidate as more adequate or more inadequate, then within one of those, as completely adequate or tending adequate, or as inadequate or tending inadequate. Thus possible scores range from 1 (Inadequate) to 4 (Completely adequate).
A segment score is the average of all scores given on this segment. A document or system score is the weighted (by segment length) average of segment scores.

Target Language: English

Human Assessment Type: Adequacy, 5-point scale

Judges evaluated the adequacy of Arabic-to-French and English-to-French translations. Each segment received one judgment. Judgments were performed using a 5-point scale. Document and system level scores are the weighted (by segment length) averages of segment scores.

Target Language: French

Human Assessment Type: Fluency, 5-point scale

Judges evaluated the fluency of Arabic-to-French and English-to-French translations. Each segment received one judgment. Judgments were performed using a 5-point scale. Document and system level scores are the weighted (by segment length) averages of segment scores.

Target Language: French

Human Assessment Type: HTER

A human annotator modifies a candidate translation so that it has the same meaning as a reference translation. Emphasis is on as few edits as possible to achieve the same meaning. Then, this modified text is used as a single reference to compute the TER score for the candidate sentence. Document and system level scores are computed as weighted (by segment length) averages of segment scores.

Target Language: English