NIST 2009 Open Machine Translation Evaluation (MT09)
Official Release of Results

Date of release: Tue Oct 27 15:48:58 2009
Version: mt09_public_v1

The NIST 2009 Open Machine Translation Evaluation (MT09) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT09 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT09 was an evaluation of research algorithms, the MT09 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

History

Evaluation Tasks

MT09 was a test of text-to-text MT technology. The evaluation consisted of three tasks, differing only by the source language processed:

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differ solely by the amount of data that was available for use in the training and development of the core MT engine. These evaluation conditions were called "Constrained Training" and "Unconstrained Training". See the evaluation specification document for a complete description of allowable resources for each.

Evaluation Tracks

In recent years, performance improvements have been demonstrated through the use of system combination techniques. For MT09, two evaluation tracks were supported that were called "Single System Track" and "System Combination Track". Results are reported separately for each track. As the names of each track implies, a key feature of systems entered in the Single System Track is that the resulting translations are produced by primarily one algorithmic approach, while translations from the System Combination Track result from a combination technique where 2 or more core algorithmic approaches are used.

Evaluation Data

The following table contains the approximate source word count, for each language pair and data genre, separately for the Current Test Set and the Progress Test Set. For the Chinese-to-English language pair, we consider that a Chinese word is 1.5 characters, on average.

Language Pair Data Genre Current Test Set Progress Test Set
Arabic-to-English Newswire 16K words (68 documents) 20K words (81 documents)
Web 15K words (67 documents) 15K words (51 documents)
Chinese-to-English Newswire 20K words (82 documents)
Web 15K words (40 documents)
Urdu-to-English Newswire 24K words (72 documents)
Web 21K words (166 documents)

Performance Measurement

  • BLEU-4 (mteval-v13a, the official MT09 evaluation metric)
  • IBM BLEU (bleu-1.04a)
  • NIST (mteval-v13a)
  • TER (tercom-0.7.25)
  • METEOR (meteor-0.7)
  • Participants

    The following table lists the organizations partitipating in MT09 and the test sets they registered to process.

    Site IDOrganizationLocationCurrent Test SetProgress Test Set
    Arabic-to-EnglishUrdu-to-EnglishArabic-to-EnglishChinese-to-English
    afrlAir Force Research LaboratoryUSA-Yes--
    amsterdamUniversity of AmsterdamNetherlandsYesYesYesYes
    apptekAppTekUSAYes-YesYes
    bbnBBN TechnologiesUSAYes-YesYes
    buaaBeihang University, Institute of Intelligent Information Processing, School of Computer Science and EngineeringChina---Yes
    cas-iaChinese Academy of Sciences, Institute of AutomationChina---Yes
    cas-ictChinese Academy of Sciences, Institute of Computing TechnologyChina---Yes
    ccidChina Center for Information Industry DevelopmentChina---Yes
    cmu-ebmtCarnegie Mellon EBMTUSA-Yes--
    cmu-smtCarnegie Mellon LTI interACTUSAYes-YesYes
    cmu-statxferCarnegie Mellon StatXferUSAYesYesYes-
    columbiaColumbia UniversityUSAYes---
    cuedCambridge University Engineering DepartmentUKYes---
    dcuDublin City UniversityIreland---Yes
    dfkiDFKI GmbHGermany---Yes
    edinburghUniversity of EdinburghUKYes-Yeswithdrew
    fbkFondazione Bruno KesslerItalyYes-Yes-
    frdcFujitsu Research & Development Center Co., Ltd.China---Yes
    hit-ltrcHarbin Institute of Technology, Language Technology Research CenterChina---Yes
    hongkongCity University of Hong KongChinawithdrewYes--
    ibmIBMUSAYes-withdrew-
    jhuJohns Hopkins UniversityUSAYesYes-withdrew
    kcslKCSL Inc.CanadaYes---
    limsiLaboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur - CNRSFranceYes---
    liumUniversité du Maine (Le Mans)France---Yes
    nju-nlpNanjing University NLPChina---Yes
    nrcNational Research Council CanadaCanada---Yes
    nthuNational Tsing Hua University, Department of Computer ScienceTaiwan---Yes
    rwthRWTH-Aachen University, Chair of Computer SciencesGermanyYes-YesYes
    sakhrSakhr SoftwareEgyptYes-Yes-
    sriSRI InternationalUSAYes-YesYes
    stanfordStanford UniversityUSAYes-withdrew-
    systranSYSTRAN Software Inc.USA-Yes--
    telavivTel Aviv UniversityIsraelYes---
    tubitak-uekaeTUBITAK-UEKAETurkeyYes-Yes-
    umdUniversity of MarylandUSAYesYeswithdrewYes
    upc-lsiUPC-LSI (Universitat Politècnica de Catalunya, Llenguatges i Sistemes Informàtics)SpainYesYesYes-
    Total (Individual Only)2191219
    Collaborations
    isi-lwUniversity of Southern California / Language Weaver Inc.USAYesYesYesYes
    lium-systranUniversité du Maine (Le Mans) / SYSTRAN.Yes-Yes-
    systran-liumSYSTRAN / Université du Maine (Le Mans).---Yes
    systran-nrcSYSTRAN / National Research Council Canada.---Yes
    Total (Individual + Collaboration)23101422
    Notes
    fscFitchburg State CollegeUSASubmission not scored.

    Results Section

    Current Test Set, Arabic-to-English Results ]   [ Current Test Set, Urdu-to-English Results ]   [ Progress Test Set Results ]