<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Open Machine Translation (OpenMT) Evaluation
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • MT08
    Human Assessment Results

    Human assessments for OpenMT08 were implemented using a participant-volunteer model and were limited to one system submission per participant, which had to be their primary system entered in either the Constrained or Unconstrained training condition. Human assessments were offered for the Arabic-to-English, Chinese-to-English, and Urdu-to-English current tests. Two types of human assessment were done in OpenMT08:

    • Adequacy asessemtents:
      An assessor was presented with one reference translation and one system translation at a time. The assessor decided on a 7-point scale how adequate the MT output was by judging how much of the pertinent information was preserved. For segments that received one of the higher scores, the assessor then proceeded to also provide a more global yes/no judgment as to whether the system translation meant essentially the same as the reference translation.
    • Preference asessments:
      An assessor viewed one reference translation and two system translations at a time and selected the MT output deemed to be the better translation given the reference translation.
    Documents included in the human assessments were selected to cover a range of attained BLEU scores. Each segment (or segment pair in the case of Preference judgments) was assessed by two independent judges.

    The human assessment results of OpenMT08 were not made available to the public, unlike originally planned. Below are two examples of the kinds of scores attained:

    • Global Yes/No Adequacy task, Arabic-to-English test set, Newswire data (100 segments): The percentage of "Yes" answers assigned (indicating that the system translation meant essentially the same as the reference translation) ranged from 61% for the highest-scoring included system to 24% for the two lowest-scoring included systems.
    • Preference task, Arabic-to-English test set for Newswire data (50 segments, full pair-wise comparisons of twelve systems): The percentage of times that a system's translation was preferred over the other system's translation ranged from 52% for the highest-scoring included system to 12% for the lowest-scoring included system.
    • [ MT Home ]

       

       

      Page Created: September 21, 2011
      Last Updated: September 21, 2011

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA