Date of Updated Release: Tuesday, November 1, 2006, version 4
The NIST 2006 Machine Translation Evaluation (MT-06) was part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-06 evaluation plan.
Disclaimer
These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-06 was an evaluation of research algorithms, the MT-06 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.
There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.
The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
The MT-06 evaluation consisted of two tasks. Each task required a system to perform translation from a given source language into the target language. The source languages were Arabic and Chinese, and the target language was English.
MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in system training and development. The evaluation conditions were called "Large Data Track" and "Unlimited Data Track".
and an unofficial track added late to the evaluation to support issues with using non-publicly available data:
Other submissions not in categories described above are not reported here.
Source Data
In an effort to reduce data creation costs, the MT-06 evaluation made use of GALE-06 evaluation data (GALE subset). NIST augmented the GALE subset with additional data of equal or greater size for most of the genres (NIST subset). This provided a larger and more diverse test set. Each set contained documents drawn from newswire text documents, web-based newsgroup documents, human transcription of broadcast news, and human transcription of broadcast conversations. The source documents were encoded in UTF-8.
The test data was selected from a pool of data collected by the LDC during February 2006. The careful selection process sought to have a variety of sources (see below), publication dates, and difficulty ratings while hitting the target test set size.
|
Genre
|
Arabic
|
Chinese
|
||
|
Sources
|
Target Size (num of reference words)
|
Sources
|
Target Size (num of reference words)
|
|
|
Newswire
|
Agence France Presse
Assabah Xinhua News Agency |
30K |
Agence France Presse
Xinhua News Agency |
30K |
|
Newsgroup
|
Google's groups
Yahoo's groups |
20K |
Google's groups
|
20K |
|
Broadcast News
|
Dubai TV
Al Jazeera Lebanese Broadcast Corporation |
20K |
Central China TV
New Tang Dynasty TV Phoenix TV |
20K |
|
Broadcast Conversation
|
Dubai TV
Al Jazeera Lebanese Broadcast Corporation |
10K |
Central China TV
New Tang Dynasty TV Phoenix TV |
10K |
Reference Data
The GALE subset had one adjudicated high quality translation that was produced by the National Virtual Translation Center. The NIST subset had four independently generated high quality translations that were produced by professional translation companies. In both subsets, each translation agency was required to have native speaker(s) of the source and target languages, working on the translations.
Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to the N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).
Although BLEU was the official metric for MT-06, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. Three additional automatic metrics METEOR, TER, and BLEU-refinement as well as human assessment were used to report the system performance. As stated in the evaluation specification document, this official public version of the results will report only the scores as measured by BLEU.
The table below lists the organizations involved in submitting MT-06 evaluation results. Most submitted results representing their own organizations, some participated only in a collaborative effort (marked by the @ symbol), and some did both (marked by the + symbol).
|
Site ID
|
Organization
|
Location
|
|
apptek
|
Applications Technology Inc.
|
USA
|
|
arl
|
Army Research Laboratory+
|
USA
|
|
auc
|
The American University
in Cairo
|
Egypt
|
|
bbn
|
BBN Technologies
|
USA
|
|
cu
|
Cambridge University@
|
UK
|
|
cmu
|
Carnegie Mellon University@
|
USA
|
|
casia
|
Institute of Automation Chinese Academy of Sciences
|
China
|
|
columbia
|
Columbia
University
|
USA
|
|
dcu
|
Dublin City University
|
Ireland
|
|
google
|
Google
|
USA
|
|
hkust
|
Hong Kong University
of Science and Technology
|
China
|
|
ibm
|
IBM
|
USA
|
|
ict
|
Institute of Computing Technology Chinese Academy of
Sciences
|
China
|
|
iscas
|
Institute
of Software Chinese Academy of Sciences
|
China
|
|
isi
|
Information Sciences Institute+
|
USA
|
|
itcirst
|
ITC-irst
|
Italy
|
|
jhu
|
Johns Hopkins University@
|
USA
|
|
ksu
|
Kansas State University
|
USA
|
|
kcsl
|
KCSL Inc.
|
Canada
|
|
lw
|
Language Weaver
|
USA
|
|
lcc
|
Language Computer
|
USA
|
|
lingua
|
Lingua Technologies Inc.
|
Canada
|
|
msr
|
Microsoft Research
|
USA
|
|
mit
|
MIT@
|
USA
|
|
nict
|
National Institute of Information and Communications
Technology
|
Japan
|
|
nlmp
|
National Laboratory on Machine Perception Peking University
|
China
|
|
ntt
|
NTT
Communication Science Laboratories
|
Japan
|
|
nrc
|
National Research Council Canada+
|
Canada
|
|
qmul
|
Queen Mary University of London
|
England
|
|
rwth
|
RWTH Aachen University+
|
Germany
|
|
sakhr
|
Sakhr Software Co.
|
USA
|
|
sri
|
SRI International
|
USA
|
|
ucb
|
University of California Berkeley
|
USA
|
|
edinburgh
|
University of Edinburgh+
|
Scotland
|
|
uka
|
University of Karlsruhe@
|
Germany
|
|
umd
|
University of Maryland@
|
USA
|
|
upenn
|
University of Pennsylvania
|
USA
|
|
upc
|
Universitat Politecnica de Catalunya
|
Spain
|
|
uw
|
University of Washington@
|
USA
|
|
xmu
|
Xiamen University
|
China
|
|
Site ID
|
Team/Collaboration
|
Location
|
|
arl-cmu
|
Army Research Laboratory & Carnegie Mellon University
|
USA
|
|
cmu-uka
|
Carnegie Mellon University & University of Karlsruhe
|
USA, Germany
|
|
edinburgh-mit
|
University of Edinburgh & MIT
|
Scotland, USA
|
|
isi-cu
|
Information Sciences Institute & Cambridge University
|
USA, England
|
|
rwth-sri-nrc-uw
|
RWTH Aachen University, SRI International, National
Rearch Council Canada, University of Washington
|
Germany, USA, Canada, USA
|
|
umd-jhu
|
University of Maryland & Johns Hopkins University
|
USA
|
Each site/team could submit one or more systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems.
The tables below list the results of the NIST 2006 Machine Translation Evaluation. The results are sorted by the BLEU scores and reported separately for the GALE subset and the NIST subset because they do not have the same number of reference translations. The results are also reported for each data domain. Note that these scores reflect case-errors.
Friedman's Rank Test for k Correlated Samples was used to test for significant difference among the systems. The initial null hypothesis was that all systems were the same. If the null hypothesis was rejected at the 95% level of confidence, the lowest scoring system was taken out of the pool of systems to be tested, and the Friedman's Rank Test was repeated for the remaining systems until no significant difference was found. The remaining systems that were not removed from the pool were deemed to be statistically equivalent. The process was repeated for the systems taken out of the pool. Alternating colors (white and yellow backgrounds) show the different groups.
Key:
Note: Site 'nlmp' was unable to process the entire test set. No result is listed for that site.
Large Data Track
NIST Subset
Overall BLEU Scores
Site ID BLEU-4 0.4281 ibm 0.3954 isi 0.3908 rwth 0.3906 apptek*# 0.3874 lw 0.3741 bbn 0.3690 ntt 0.3680 itcirst 0.3466 cmu-uka 0.3369 umd-jhu 0.3333 edinburgh*# 0.3303 sakhr 0.3296 nict 0.2930 qmul 0.2896 lcc 0.2778 upc 0.2741 columbia 0.2465 ucb 0.1978 auc 0.1531 dcu 0.0947 kcsl*# 0.0522 Newswire BLEU Scores
Site ID BLEU-4 0.4814 ibm 0.4542 rwth 0.4441 isi 0.4426 lw 0.4368 bbn 0.4254 apptek*# 0.4212 ntt 0.4035 umd-jhu 0.3997 edinburgh*# 0.3945 cmu-uka 0.3943 itcirst 0.3798 qmul 0.3737 sakhr 0.3736 nict 0.3568 lcc 0.3089 upc 0.3049 columbia 0.2759 ucb 0.2369 auc 0.1750 dcu 0.0875 kcsl*# 0.0423 Newsgroup BLEU Scores
Site ID BLEU-4 apptek*# 0.3311 0.3225 ntt 0.2973 isi 0.2895 ibm 0.2774 bbn 0.2771 rwth 0.2726 itcirst 0.2696 sakhr 0.2634 lw 0.2503 cmu 0.2436 edinburgh*# 0.2208 lcc 0.2135 columbia 0.2111 umd-jhu 0.2059 nict 0.1875 upc 0.1842 ucb 0.1690 dcu 0.1177 qmul 0.1116 auc 0.1099 kcsl*# 0.0770 Broadcast News BLEU Scores
Site ID BLEU-4 0.3781 apptek*# 0.3729 lw 0.3646 isi 0.3630 ibm 0.3612 rwth 0.3511 ntt 0.3324 bbn 0.3302 umd-jhu 0.3148 itcirst 0.3128 edinburgh*# 0.2925 cmu 0.2874 sakhr 0.2814 qmul 0.2768 upc 0.2463 nict 0.2458 lcc 0.2445 columbia 0.2054 auc 0.1419 ucb 0.1114 dcu 0.0594 kcsl*# 0.0326 GALE Subset
Overall BLEU Scores
Site ID BLEU-4 apptek*# 0.1918 0.1826 isi 0.1714 ibm 0.1674 sakhr 0.1648 rwth 0.1639 lw 0.1594 ntt 0.1533 itcirst 0.1475 bbn 0.1461 cmu 0.1392 umd-jhu 0.1370 qmul 0.1345 edinburgh*# 0.1305 nict 0.1192 upc 0.1149 lcc 0.1129 columbia 0.0960 ucb 0.0732 auc 0.0635 dcu 0.0320 kcsl*# 0.0176 Newswire BLEU Scores
Site ID BLEU-4 0.2647 ibm 0.2432 isi 0.2300 rwth 0.2263 apptek*# 0.2225 sakhr 0.2196 lw 0.2193 ntt 0.2180 bbn 0.2170 itcirst 0.2104 umd-jhu 0.2084 cmu 0.2055 edinburgh*# 0.2052 qmul 0.1984 nict 0.1773 lcc 0.1648 upc 0.1575 columbia 0.1438 ucb 0.1299 auc 0.0937 dcu 0.0466 kcsl*# 0.0182 Newsgroup BLEU Scores
Site ID BLEU-4 apptek*# 0.1747 sakhr 0.1331 0.1130 ibm 0.1060 rwth 0.1017 isi 0.0918 ntt 0.0906 lw 0.0853 cmu 0.0840 bbn 0.0837 itcirst 0.0821 qmul 0.0818 umd-jhu 0.0754 edinburgh*# 0.0681 lcc 0.0643 nict 0.0639 columbia 0.0634 upc 0.0603 ucb 0.0411 auc 0.0326 dcu 0.0254 kcsl*# 0.0089 Broadcast News BLEU Scores
Site ID BLEU-4 apptek*# 0.1944 isi 0.1766 0.1721 lw 0.1649 rwth 0.1599 ibm 0.1588 sakhr 0.1495 itcirst 0.1471 ntt 0.1469 bbn 0.1391 cmu 0.1362 umd-jhu 0.1309 qmul 0.1266 edinburgh*# 0.1240 nict 0.1152 upc 0.1150 lcc 0.1016 columbia 0.0879 auc 0.0619 ucb 0.0412 dcu 0.0252 kcsl*# 0.0229 Broadcast Conversation BLEU Scores
Site ID BLEU-4 isi 0.1756 apptek*# 0.1747 0.1745 rwth 0.1615 lw 0.1582 ibm 0.1563 ntt 0.1512 sakhr 0.1446 itcirst 0.1425 bbn 0.1400 umd-jhu 0.1277 qmul 0.1265 cmu 0.1261 edinburgh*# 0.1203 upc 0.1200 lcc 0.1157 nict 0.1156 columbia 0.0866 ucb 0.0783 auc 0.0620 dcu 0.0306 kcsl*# 0.0183
Unlimited Data Track
NIST Subset
Overall BLEU Scores
Site ID BLEU-4 0.4535 lw 0.4008 rwth 0.3970 rwth+sri+nrc+uw* 0.3966 nrc 0.3750 sri 0.3743 edinburgh*# 0.3449 cmu 0.3376 arl-cmu 0.1424 Newswire BLEU Scores
Site ID BLEU-4 0.5034 lw 0.4589 rwth+sri+nrc+uw* 0.4493 rwth 0.4458 nrc 0.4300 sri 0.4240 edinburgh*# 0.4133 cmu 0.3974 arl-cmu 0.1402 Newsgroup BLEU Scores
Site ID BLEU-4 0.3652 lw 0.2851 rwth 0.2829 nrc 0.2799 rwth+sri+nrc+uw* 0.2755 sri 0.2534 cmu 0.2372 edinburgh*# 0.2287 arl-cmu 0.1485 Broadcast News BLEU Scores
Site ID BLEU-4 0.4018 lw 0.3685 rwth 0.3662 rwth+sri+nrc+uw* 0.3639 sri 0.3326 nrc 0.3312 edinburgh*# 0.3049 cmu 0.2988 arl-cmu 0.1363 GALE Subset
Overall BLEU Scores
Site ID BLEU-4 0.1957 lw 0.1721 rwth+sri+nrc+uw* 0.1710 rwth 0.1680 sri 0.1614 nrc 0.1517 cmu 0.1382 edinburgh*# 0.1365 arl-cmu 0.0736 Newswire BLEU Scores
Site ID BLEU-4 0.2812 lw 0.2294 rwth+sri+nrc+uw* 0.2289 rwth 0.2258 nrc 0.2172 sri 0.2081 edinburgh*# 0.2068 cmu 0.2006 arl-cmu 0.0858 Newsgroup BLEU Scores
Site ID BLEU-4 0.1267 rwth 0.1133 rwth+sri+nrc+uw* 0.1078 lw 0.1007 nrc 0.1007 sri 0.0953 cmu 0.0894 edinburgh*# 0.0722 arl-cmu 0.0558 Broadcast News BLEU Scores
Site ID BLEU-4 0.1868 rwth+sri+nrc+uw* 0.1730 lw 0.1715 sri 0.1661 rwth 0.1625 nrc 0.1415 edinburgh*# 0.1293 cmu 0.1276 arl-cmu 0.0855 Broadcast Conversation BLEU Scores
Site ID BLEU-4 0.1824 lw 0.1756 rwth+sri+nrc+uw* 0.1676 sri 0.1671 rwth 0.1658 nrc 0.1429 edinburgh*# 0.1341 cmu 0.1322 arl-cmu 0.0584
Large Data Track
NIST Subset
Overall BLEU Scores
Site ID BLEU-4 isi 0.3393 0.3316 lw 0.3278 rwth 0.3022 ict 0.2913 edinburgh*# 0.2830 bbn 0.2781 nrc 0.2762 itcirst 0.2749 umd-jhu 0.2704 ntt 0.2595 nict 0.2449 cmu 0.2348 msr 0.2314 qmul 0.2276 hkust 0.2080 upc 0.2071 upenn 0.1958 iscas 0.1816 lcc 0.1814 xmu 0.1580 lingua* 0.1341 kcsl*# 0.0512 ksu 0.0401 Newswire BLEU Scores
Site ID BLEU-4 isi 0.3486 0.3470 lw 0.3404 ict 0.3085 rwth 0.3022 nrc 0.2867 umd-jhu 0.2863 edinburgh*# 0.2776 bbn 0.2774 itcirst 0.2739 ntt 0.2656 nict 0.2509 cmu 0.2496 msr 0.2387 qmul 0.2299 upenn 0.2064 upc 0.2057 hkust 0.1999 lcc 0.1721 iscas 0.1715 xmu 0.1619 lingua* 0.1412 kcsl*# 0.0510 ksu 0.0380 Newsgroup BLEU Scores
Site ID BLEU-4 0.2620 isi 0.2571 lw 0.2454 edinburgh*# 0.2434 rwth 0.2417 nrc 0.2330 ict 0.2325 bbn 0.2275 itcirst 0.2264 umd-jhu 0.2061 ntt 0.2036 nict 0.2006 msr 0.1878 cmu 0.1865 hkust 0.1851 qmul 0.1840 iscas 0.1681 upenn 0.1665 lcc 0.1634 upc 0.1619 xmu 0.1406 lingua* 0.1207 kcsl*# 0.0531 ksu 0.0361 Broadcast News BLEU Scores
Site ID BLEU-4 rwth 0.3501 0.3481 isi 0.3463 lw 0.3327 bbn 0.3197 edinburgh*# 0.3172 itcirst 0.3128 ict 0.2977 ntt 0.2928 umd-jhu 0.2928 nrc 0.2914 qmul 0.2571 nict 0.2568 msr 0.2527 cmu 0.2468 upc 0.2403 hkust 0.2376 iscas 0.2090 lcc 0.2046 upenn 0.2008 xmu 0.1652 lingua* 0.1323 kcsl*# 0.0475 ksu 0.0464 GALE Subset
Overall BLEU Scores
Site ID BLEU-4 0.1470 isi 0.1413 lw 0.1299 edinburgh*# 0.1199 itcirst 0.1194 nrc 0.1194 rwth 0.1187 ict 0.1185 bbn 0.1165 umd-jhu 0.1140 cmu 0.1135 ntt 0.1116 nict 0.1106 hkust 0.0984 msr 0.0972 qmul 0.0943 upc 0.0931 upenn 0.0923 iscas 0.0860 lcc 0.0813 xmu 0.0747 lingua* 0.0663 ksu 0.0218 kcsl*# 0.0199 Newswire BLEU Scores
Site ID BLEU-4 0.1905 isi 0.1685 lw 0.1596 ict 0.1515 edinburgh*# 0.1467 rwth 0.1448 bbn 0.1433 umd-jhu 0.1419 nrc 0.1404 itcirst 0.1377 cmu 0.1353 ntt 0.1350 msr 0.1280 hkust 0.1161 nict 0.1155 qmul 0.1102 upenn 0.1068 upc 0.1039 iscas 0.0947 lcc 0.0878 xmu 0.0861 lingua* 0.0657 kcsl*# 0.0178 ksu 0.0138 Newsgroup BLEU Scores
Site ID BLEU-4 0.1365 isi 0.1235 edinburgh*# 0.1140 lw 0.1137 ict 0.1130 itcirst 0.1108 nrc 0.1098 nict 0.1075 rwth 0.1071 cmu 0.1054 bbn 0.1049 ntt 0.1026 umd-jhu 0.0978 upenn 0.0941 hkust 0.0892 qmul 0.0858 upc 0.0851 msr 0.0841 lcc 0.0765 iscas 0.0745 lingua* 0.0687 xmu 0.0681 ksu 0.0249 kcsl*# 0.0177 Broadcast News BLEU Scores
Site ID BLEU-4 isi 0.1441 0.1409 lw 0.1343 rwth 0.1231 itcirst 0.1193 nrc 0.1192 cmu 0.1159 bbn 0.1146 ict 0.1146 edinburgh*# 0.1110 ntt 0.1096 nict 0.1090 umd-jhu 0.1084 hkust 0.1005 upc 0.0986 qmul 0.0951 msr 0.0922 iscas 0.0891 upenn 0.0882 lcc 0.0814 xmu 0.0705 lingua* 0.0609 kcsl*# 0.0204 ksu 0.0192 Broadcast Conversation BLEU Scores
Site ID BLEU-4 isi 0.1280 0.1262 edinburgh*# 0.1119 lw 0.1112 itcirst 0.1106 nict 0.1106 umd-jhu 0.1102 nrc 0.1095 bbn 0.1060 ntt 0.1016 rwth 0.1013 ict 0.0990 cmu 0.0973 hkust 0.0891 msr 0.0873 qmul 0.0870 upc 0.0848 iscas 0.0842 upenn 0.0815 lcc 0.0796 xmu 0.0753 lingua* 0.0700 ksu 0.0270 kcsl*# 0.0223
Unlimited Data Track
NIST Subset
Overall BLEU Scores
Site ID BLEU-4 0.3496 rwth 0.2975 edinburgh*# 0.2843 cmu 0.2449 casia 0.1894 xmu 0.1713 Newswire BLEU Scores
Site ID BLEU-4 0.3634 rwth 0.2974 edinburgh*# 0.2852 cmu 0.2430 casia 0.1905 xmu 0.1696 Newsgroup BLEU Scores
Site ID BLEU-4 0.2870 edinburgh*# 0.2450 rwth 0.2307 cmu 0.2004 casia 0.1709 xmu 0.1618 Broadcast News BLEU Scores
Site ID BLEU-4 0.3649 rwth 0.3509 edinburgh*# 0.3142 cmu 0.2644 casia 0.1889 xmu 0.1818 GALE Subset
Overall BLEU Scores
Site ID BLEU-4 0.1526 edinburgh*# 0.1187 rwth 0.1172 cmu 0.1034 casia 0.0900 xmu 0.0793 Newswire BLEU Scores
Site ID BLEU-4 0.2057 edinburgh*# 0.1465 rwth 0.1436 cmu 0.1158 casia 0.1001 xmu 0.0817 Newsgroup BLEU Scores
Site ID BLEU-4 0.1432 edinburgh*# 0.1070 rwth 0.1032 cmu 0.1015 casia 0.0916 xmu 0.0782 Broadcast News BLEU Scores
Site ID BLEU-4 0.1482 rwth 0.1224 edinburgh*# 0.1090 cmu 0.1020 casia 0.0891 xmu 0.0775 Broadcast Conversation BLEU Scores
Site ID BLEU-4 0.1206 edinburgh*# 0.1157 rwth 0.1010 cmu 0.0957 casia 0.0812 xmu 0.0801
| NIST data set | BLEU-4 | ||||
| Site ID | Language | Overall | Newswire | Newsgroup | Broadcast News |
| Arabic | 0.4569 | 0.5060 | 0.3727 | 0.4076 | |
| Chinese | 0.3615 | 0.3725 | 0.2926 | 0.3859 | |
| GALE data set | BLEU-4 | |||||
| Site ID | Language | Overall | Newswire | Newsgroup | Broadcast News | Broadcast Conversation |
| Arabic | 0.2024 | 0.2820 | 0.1359 | 0.1932 | 0.1925 | |
| Chinese | 0.1576 | 0.2086 | 0.1454 | 0.1532 | 0.1300 | |