NIST is pleased to introduce the MetricsMATR Challenge, a new series of research challenge events for machine translation (MT) metrology promoting the development of innovative, even revolutionary, MT metrics. MetricsMATR focuses entirely on MT metrics.


NIST has been conducting formal evaluations of machine translation (MT) technology since 2002, and while the evaluations have been successful, there is still a need for a better understanding of exactly how useful the state-of-the-art technology is, and how to best interpret the scores reported during evaluation.

This need exists primarily due to the shortcomings with the current methods employed for the evaluation of Machine Translation technology:

  1. Automatic metrics have not yet been proved able to consistently predict the usefulness, adequacy, and reliability of MT technologies.
  2. Automatic metrics have not demonstrated that they are as meaningful in target languages other than English.
  3. Human assessments are expensive, slow, subjective, and are difficult to standardize. Furthermore they only pertain to the translations evaluated, and are of no use even to updated translations from the same system.
  4. Both automatic metrics and human assessments need more insights into what properties of the translation should be evaluated, as well as insights into how to evaluate those properties.
  5. Some MT technology approaches evaluated incorporate algorithms that optimize scores on MT metric(s). These optimizations fail in the same respects that the metrics fail.

These problems, and the need to overcome them through the development of improved automatic (and even semi-automatic) metrics, have been a constant point of discussion at past NIST MT evaluations. Without more appropriate metrics to address these shortcomings, the impact of formative and summative MT technology evaluations will remain limited.

NIST is running a new MT evaluation series "MetricsMATR" designed to address this need for improved, even revolutionary, MT metrics.

More details regarding this evaluation can be found on the MetricsMATR home page.


The Metrics MATR evaluation set is composed of data from different sources. It has several language pairs, data genres, and human assessment types.



This report includes 39 metrics, including 7 baseline metrics. More details are available on the metrics page.

Correlation Results

Our correlation analysis of MetricsMATR 2008 metrics will continue to expand for quite some time. Click here for the root node of our analysis.


References to this report should cite:


Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

Questions and comments regarding these reports may be sent to