NIST 2009 Open Machine Translation Evaluation (MT09)
Informal System Combination Results

Date of release: Tue Oct 27 15:48:58 2009
Version: mt09_public_v1


The NIST 2009 Open Machine Translation Evaluation (MT09) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT09 evaluation plan.

Informal System Combination was an informal, diagnostic MT09 task, offered after the official evaluation period. Output from several MT09 systems on the Arabic-toEnglish and Urdu-to-English Current tests was anonymized and provided for system combination purposes. Participants in this category produced new output based on those provided translations.

Scores reported here are limited to primary Informal System Combination submissions.


These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT09 was an evaluation of research algorithms, the MT09 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.


Evaluation Data

System output for the Informal System Combination track included output of the Arabic-to-English and Urdu-to-English Current tests. Approximately 30% of the test data was designated as a development set for system combination. The remainder of the system output was provided as the test set.

Language Pair Data Genre Development Set Evaluation Set
Arabic-to-English Newswire 17 documents 42 documents
Web 16 documents 40 documents
Urdu-to-English Newswire 20 documents 48 documents
Web 48 documents 114 documents

Informal System Combination Results

Arabic-to-English (Table 1)

Site IDSystemBLEU-4 (mteval-v13a)IBM BLEU (bleu-1.04)NIST (mteval-v13a)TER (tercom-0.7.25)METEOR (meteor-0.7)
Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)

Urdu-to-English (Table 2)

Site IDSystemBLEU-4 (mteval-v13a)IBM BLEU (bleu-1.04)NIST (mteval-v13a)TER (tercom-0.7.25)METEOR (meteor-0.7)
Highest individual system score in ISC test set (system with highest BLEU-4 score on Overall data set)