NIST 2006 Machine Translation Evaluation Official Results

Date of Updated Release: Tuesday, November 1, 2006, version 4

The NIST 2006 Machine Translation Evaluation (MT-06) was part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-06 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-06 was an evaluation of research algorithms, the MT-06 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Tasks

The MT-06 evaluation consisted of two tasks. Each task required a system to perform translation from a given source language into the target language. The source languages were Arabic and Chinese, and the target language was English.

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in system training and development. The evaluation conditions were called "Large Data Track" and "Unlimited Data Track".

Other submissions not in categories described above are not reported here.

Evaluation Data

Source Data

In an effort to reduce data creation costs, the MT-06 evaluation made use of GALE-06 evaluation data (GALE subset). NIST augmented the GALE subset with additional data of equal or greater size for most of the genres (NIST subset). This provided a larger and more diverse test set. Each set contained documents drawn from newswire text documents, web-based newsgroup documents, human transcription of broadcast news, and human transcription of broadcast conversations. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during February 2006. The careful selection process sought to have a variety of sources (see below), publication dates, and difficulty ratings while hitting the target test set size.

Genre
Arabic
Chinese
Sources
Target Size (num of reference words)
Sources
Target Size (num of reference words)
Newswire
Agence France Presse
Assabah
Xinhua News Agency
30K
Agence France Presse
Xinhua News Agency
30K
Newsgroup
Google's groups
Yahoo's groups
20K
Google's groups
20K
Broadcast News
Dubai TV
Al Jazeera
Lebanese Broadcast Corporation
20K
Central China TV
New Tang Dynasty TV
Phoenix TV
20K
Broadcast Conversation
Dubai TV
Al Jazeera
Lebanese Broadcast Corporation
10K
Central China TV
New Tang Dynasty TV
Phoenix TV
10K

Reference Data

The GALE subset had one adjudicated high quality translation that was produced by the National Virtual Translation Center. The NIST subset had four independently generated high quality translations that were produced by professional translation companies. In both subsets, each translation agency was required to have native speaker(s) of the source and target languages, working on the translations.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to the N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-06, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance. Three additional automatic metrics METEOR, TER, and BLEU-refinement as well as human assessment were used to report the system performance. As stated in the evaluation specification document, this official public version of the results will report only the scores as measured by BLEU.

Evaluation Participants

The table below lists the organizations involved in submitting MT-06 evaluation results. Most submitted results representing their own organizations, some participated only in a collaborative effort (marked by the @ symbol), and some did both (marked by the + symbol).

Site ID
Organization
Location
apptek
Applications Technology Inc.
USA
arl
Army Research Laboratory+
USA
auc
The American University in Cairo
Egypt
bbn
BBN Technologies
USA
cu
Cambridge University@
UK
cmu
Carnegie Mellon University@
USA
casia
Institute of Automation Chinese Academy of Sciences
China
columbia
Columbia University
USA
dcu
Dublin City University
Ireland
google
Google
USA
hkust
Hong Kong University of Science and Technology
China
ibm
IBM
USA
ict
Institute of Computing Technology Chinese Academy of Sciences
China
iscas
Institute of Software Chinese Academy of Sciences
China
isi
Information Sciences Institute+
USA
itcirst
ITC-irst
Italy
jhu
Johns Hopkins University@
USA
ksu
Kansas State University
USA
kcsl
KCSL Inc.
Canada
lw
Language Weaver
USA
lcc
Language Computer
USA
lingua
Lingua Technologies Inc.
Canada
msr
Microsoft Research
USA
mit
MIT@
USA
nict
National Institute of Information and Communications Technology
Japan
nlmp
National Laboratory on Machine Perception Peking University
China
ntt
NTT Communication Science Laboratories
Japan
nrc
National Research Council Canada+
Canada
qmul
Queen Mary University of London
England
rwth
RWTH Aachen University+
Germany
sakhr
Sakhr Software Co.
USA
sri
SRI International
USA
ucb
University of California Berkeley
USA
edinburgh
University of Edinburgh+
Scotland
uka
University of Karlsruhe@
Germany
umd
University of Maryland@
USA
upenn
University of Pennsylvania
USA
upc
Universitat Politecnica de Catalunya
Spain
uw
University of Washington@
USA
xmu
Xiamen University
China
Site ID
Team/Collaboration
Location
arl-cmu
Army Research Laboratory & Carnegie Mellon University
USA
cmu-uka
Carnegie Mellon University & University of Karlsruhe
USA, Germany
edinburgh-mit
University of Edinburgh & MIT
Scotland, USA
isi-cu
Information Sciences Institute & Cambridge University
USA, England
rwth-sri-nrc-uw
RWTH Aachen University, SRI International, National Rearch Council Canada, University of Washington
Germany, USA, Canada, USA
umd-jhu
University of Maryland & Johns Hopkins University
USA

Evaluation Systems

Each site/team could submit one or more systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems.

Evaluation Results

The tables below list the results of the NIST 2006 Machine Translation Evaluation. The results are sorted by the BLEU scores and reported separately for the GALE subset and the NIST subset because they do not have the same number of reference translations. The results are also reported for each data domain. Note that these scores reflect case-errors.

Friedman's Rank Test for k Correlated Samples was used to test for significant difference among the systems. The initial null hypothesis was that all systems were the same. If the null hypothesis was rejected at the 95% level of confidence, the lowest scoring system was taken out of the pool of systems to be tested, and the Friedman's Rank Test was repeated for the remaining systems until no significant difference was found. The remaining systems that were not removed from the pool were deemed to be statistically equivalent. The process was repeated for the systems taken out of the pool. Alternating colors (white and yellow backgrounds) show the different groups.

Key:

Note: Site 'nlmp' was unable to process the entire test set. No result is listed for that site.

Arabic-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.4281
ibm
0.3954
isi
0.3908
rwth
0.3906
apptek*#
0.3874
lw
0.3741
bbn
0.3690
ntt
0.3680
itcirst
0.3466
cmu-uka
0.3369
umd-jhu
0.3333
edinburgh*#
0.3303
sakhr
0.3296
nict
0.2930
qmul
0.2896
lcc
0.2778
upc
0.2741
columbia
0.2465
ucb
0.1978
auc
0.1531
dcu
0.0947
kcsl*#
0.0522

Newswire BLEU Scores

Site ID
BLEU-4
google
0.4814
ibm
0.4542
rwth
0.4441
isi
0.4426
lw
0.4368
bbn
0.4254
apptek*#
0.4212
ntt
0.4035
umd-jhu
0.3997
edinburgh*#
0.3945
cmu-uka
0.3943
itcirst
0.3798
qmul
0.3737
sakhr
0.3736
nict
0.3568
lcc
0.3089
upc
0.3049
columbia
0.2759
ucb
0.2369
auc
0.1750
dcu
0.0875
kcsl*#
0.0423

Newsgroup BLEU Scores

Site ID
BLEU-4
apptek*#
0.3311
google
0.3225
ntt
0.2973
isi
0.2895
ibm
0.2774
bbn
0.2771
rwth
0.2726
itcirst
0.2696
sakhr
0.2634
lw
0.2503
cmu
0.2436
edinburgh*#
0.2208
lcc
0.2135
columbia
0.2111
umd-jhu
0.2059
nict
0.1875
upc
0.1842
ucb
0.1690
dcu
0.1177
qmul
0.1116
auc
0.1099
kcsl*#
0.0770

Broadcast News BLEU Scores

Site ID
BLEU-4
google
0.3781
apptek*#
0.3729
lw
0.3646
isi
0.3630
ibm
0.3612
rwth
0.3511
ntt
0.3324
bbn
0.3302
umd-jhu
0.3148
itcirst
0.3128
edinburgh*#
0.2925
cmu
0.2874
sakhr
0.2814
qmul
0.2768
upc
0.2463
nict
0.2458
lcc
0.2445
columbia
0.2054
auc
0.1419
ucb
0.1114
dcu
0.0594
kcsl*#
0.0326

GALE Subset

Overall BLEU Scores

Site ID
BLEU-4
apptek*#
0.1918
google
0.1826
isi
0.1714
ibm
0.1674
sakhr
0.1648
rwth
0.1639
lw
0.1594
ntt
0.1533
itcirst
0.1475
bbn
0.1461
cmu
0.1392
umd-jhu
0.1370
qmul
0.1345
edinburgh*#
0.1305
nict
0.1192
upc
0.1149
lcc
0.1129
columbia
0.0960
ucb
0.0732
auc
0.0635
dcu
0.0320
kcsl*#
0.0176

Newswire BLEU Scores

Site ID
BLEU-4
google
0.2647
ibm
0.2432
isi
0.2300
rwth
0.2263
apptek*#
0.2225
sakhr
0.2196
lw
0.2193
ntt
0.2180
bbn
0.2170
itcirst
0.2104
umd-jhu
0.2084
cmu
0.2055
edinburgh*#
0.2052
qmul
0.1984
nict
0.1773
lcc
0.1648
upc
0.1575
columbia
0.1438
ucb
0.1299
auc
0.0937
dcu
0.0466
kcsl*#
0.0182

Newsgroup BLEU Scores

Site ID
BLEU-4
apptek*#
0.1747
sakhr
0.1331
google
0.1130
ibm
0.1060
rwth
0.1017
isi
0.0918
ntt
0.0906
lw
0.0853
cmu
0.0840
bbn
0.0837
itcirst
0.0821
qmul
0.0818
umd-jhu
0.0754
edinburgh*#
0.0681
lcc
0.0643
nict
0.0639
columbia
0.0634
upc
0.0603
ucb
0.0411
auc
0.0326
dcu
0.0254
kcsl*#
0.0089

Broadcast News BLEU Scores

Site ID
BLEU-4
apptek*#
0.1944
isi
0.1766
google
0.1721
lw
0.1649
rwth
0.1599
ibm
0.1588
sakhr
0.1495
itcirst
0.1471
ntt
0.1469
bbn
0.1391
cmu
0.1362
umd-jhu
0.1309
qmul
0.1266
edinburgh*#
0.1240
nict
0.1152
upc
0.1150
lcc
0.1016
columbia
0.0879
auc
0.0619
ucb
0.0412
dcu
0.0252
kcsl*#
0.0229

Broadcast Conversation BLEU Scores

Site ID
BLEU-4
isi
0.1756
apptek*#
0.1747
google
0.1745
rwth
0.1615
lw
0.1582
ibm
0.1563
ntt
0.1512
sakhr
0.1446
itcirst
0.1425
bbn
0.1400
umd-jhu
0.1277
qmul
0.1265
cmu
0.1261
edinburgh*#
0.1203
upc
0.1200
lcc
0.1157
nict
0.1156
columbia
0.0866
ucb
0.0783
auc
0.0620
dcu
0.0306
kcsl*#
0.0183

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.4535
lw
0.4008
rwth
0.3970
rwth+sri+nrc+uw*
0.3966
nrc
0.3750
sri
0.3743
edinburgh*#
0.3449
cmu
0.3376
arl-cmu
0.1424

Newswire BLEU Scores

Site ID
BLEU-4
google
0.5034
lw
0.4589
rwth+sri+nrc+uw*
0.4493
rwth
0.4458
nrc
0.4300
sri
0.4240
edinburgh*#
0.4133
cmu
0.3974
arl-cmu
0.1402

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.3652
lw
0.2851
rwth
0.2829
nrc
0.2799
rwth+sri+nrc+uw*
0.2755
sri
0.2534
cmu
0.2372
edinburgh*#
0.2287
arl-cmu
0.1485

Broadcast News BLEU Scores

Site ID
BLEU-4
google
0.4018
lw
0.3685
rwth
0.3662
rwth+sri+nrc+uw*
0.3639
sri
0.3326
nrc
0.3312
edinburgh*#
0.3049
cmu
0.2988
arl-cmu
0.1363

GALE Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.1957
lw
0.1721
rwth+sri+nrc+uw*
0.1710
rwth
0.1680
sri
0.1614
nrc
0.1517
cmu
0.1382
edinburgh*#
0.1365
arl-cmu
0.0736

Newswire BLEU Scores

Site ID
BLEU-4
google
0.2812
lw
0.2294
rwth+sri+nrc+uw*
0.2289
rwth
0.2258
nrc
0.2172
sri
0.2081
edinburgh*#
0.2068
cmu
0.2006
arl-cmu
0.0858

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.1267
rwth
0.1133
rwth+sri+nrc+uw*
0.1078
lw
0.1007
nrc
0.1007
sri
0.0953
cmu
0.0894
edinburgh*#
0.0722
arl-cmu
0.0558

Broadcast News BLEU Scores

Site ID
BLEU-4
google
0.1868
rwth+sri+nrc+uw*
0.1730
lw
0.1715
sri
0.1661
rwth
0.1625
nrc
0.1415
edinburgh*#
0.1293
cmu
0.1276
arl-cmu
0.0855

Broadcast Conversation BLEU Scores

Site ID
BLEU-4
google
0.1824
lw
0.1756
rwth+sri+nrc+uw*
0.1676
sri
0.1671
rwth
0.1658
nrc
0.1429
edinburgh*#
0.1341
cmu
0.1322
arl-cmu
0.0584

Chinese-to-English Results

Large Data Track

NIST Subset

Overall BLEU Scores

Site ID
BLEU-4
isi
0.3393
google
0.3316
lw
0.3278
rwth
0.3022
ict
0.2913
edinburgh*#
0.2830
bbn
0.2781
nrc
0.2762
itcirst
0.2749
umd-jhu
0.2704
ntt
0.2595
nict
0.2449
cmu
0.2348
msr
0.2314
qmul
0.2276
hkust
0.2080
upc
0.2071
upenn
0.1958
iscas
0.1816
lcc
0.1814
xmu
0.1580
lingua*
0.1341
kcsl*#
0.0512
ksu
0.0401

Newswire BLEU Scores

Site ID
BLEU-4
isi
0.3486
google
0.3470
lw
0.3404
ict
0.3085
rwth
0.3022
nrc
0.2867
umd-jhu
0.2863
edinburgh*#
0.2776
bbn
0.2774
itcirst
0.2739
ntt
0.2656
nict
0.2509
cmu
0.2496
msr
0.2387
qmul
0.2299
upenn
0.2064
upc
0.2057
hkust
0.1999
lcc
0.1721
iscas
0.1715
xmu
0.1619
lingua*
0.1412
kcsl*#
0.0510
ksu
0.0380

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.2620
isi
0.2571
lw
0.2454
edinburgh*#
0.2434
rwth
0.2417
nrc
0.2330
ict
0.2325
bbn
0.2275
itcirst
0.2264
umd-jhu
0.2061
ntt
0.2036
nict
0.2006
msr
0.1878
cmu
0.1865
hkust
0.1851
qmul
0.1840
iscas
0.1681
upenn
0.1665
lcc
0.1634
upc
0.1619
xmu
0.1406
lingua*
0.1207
kcsl*#
0.0531
ksu
0.0361

Broadcast News BLEU Scores

Site ID
BLEU-4
rwth
0.3501
google
0.3481
isi
0.3463
lw
0.3327
bbn
0.3197
edinburgh*#
0.3172
itcirst
0.3128
ict
0.2977
ntt
0.2928
umd-jhu
0.2928
nrc
0.2914
qmul
0.2571
nict
0.2568
msr
0.2527
cmu
0.2468
upc
0.2403
hkust
0.2376
iscas
0.2090
lcc
0.2046
upenn
0.2008
xmu
0.1652
lingua*
0.1323
kcsl*#
0.0475
ksu
0.0464

GALE Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.1470
isi
0.1413
lw
0.1299
edinburgh*#
0.1199
itcirst
0.1194
nrc
0.1194
rwth
0.1187
ict
0.1185
bbn
0.1165
umd-jhu
0.1140
cmu
0.1135
ntt
0.1116
nict
0.1106
hkust
0.0984
msr
0.0972
qmul
0.0943
upc
0.0931
upenn
0.0923
iscas
0.0860
lcc
0.0813
xmu
0.0747
lingua*
0.0663
ksu
0.0218
kcsl*#
0.0199

Newswire BLEU Scores

Site ID
BLEU-4
google
0.1905
isi
0.1685
lw
0.1596
ict
0.1515
edinburgh*#
0.1467
rwth
0.1448
bbn
0.1433
umd-jhu
0.1419
nrc
0.1404
itcirst
0.1377
cmu
0.1353
ntt
0.1350
msr
0.1280
hkust
0.1161
nict
0.1155
qmul
0.1102
upenn
0.1068
upc
0.1039
iscas
0.0947
lcc
0.0878
xmu
0.0861
lingua*
0.0657
kcsl*#
0.0178
ksu
0.0138

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.1365
isi
0.1235
edinburgh*#
0.1140
lw
0.1137
ict
0.1130
itcirst
0.1108
nrc
0.1098
nict
0.1075
rwth
0.1071
cmu
0.1054
bbn
0.1049
ntt
0.1026
umd-jhu
0.0978
upenn
0.0941
hkust
0.0892
qmul
0.0858
upc
0.0851
msr
0.0841
lcc
0.0765
iscas
0.0745
lingua*
0.0687
xmu
0.0681
ksu
0.0249
kcsl*#
0.0177

Broadcast News BLEU Scores

Site ID
BLEU-4
isi
0.1441
google
0.1409
lw
0.1343
rwth
0.1231
itcirst
0.1193
nrc
0.1192
cmu
0.1159
bbn
0.1146
ict
0.1146
edinburgh*#
0.1110
ntt
0.1096
nict
0.1090
umd-jhu
0.1084
hkust
0.1005
upc
0.0986
qmul
0.0951
msr
0.0922
iscas
0.0891
upenn
0.0882
lcc
0.0814
xmu
0.0705
lingua*
0.0609
kcsl*#
0.0204
ksu
0.0192

Broadcast Conversation BLEU Scores

Site ID
BLEU-4
isi
0.1280
google
0.1262
edinburgh*#
0.1119
lw
0.1112
itcirst
0.1106
nict
0.1106
umd-jhu
0.1102
nrc
0.1095
bbn
0.1060
ntt
0.1016
rwth
0.1013
ict
0.0990
cmu
0.0973
hkust
0.0891
msr
0.0873
qmul
0.0870
upc
0.0848
iscas
0.0842
upenn
0.0815
lcc
0.0796
xmu
0.0753
lingua*
0.0700
ksu
0.0270
kcsl*#
0.0223

Unlimited Data Track

NIST Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.3496
rwth
0.2975
edinburgh*#
0.2843
cmu
0.2449
casia
0.1894
xmu
0.1713

Newswire BLEU Scores

Site ID
BLEU-4
google
0.3634
rwth
0.2974
edinburgh*#
0.2852
cmu
0.2430
casia
0.1905
xmu
0.1696

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.2870
edinburgh*#
0.2450
rwth
0.2307
cmu
0.2004
casia
0.1709
xmu
0.1618

Broadcast News BLEU Scores

Site ID
BLEU-4
google
0.3649
rwth
0.3509
edinburgh*#
0.3142
cmu
0.2644
casia
0.1889
xmu
0.1818

GALE Subset

Overall BLEU Scores

Site ID
BLEU-4
google
0.1526
edinburgh*#
0.1187
rwth
0.1172
cmu
0.1034
casia
0.0900
xmu
0.0793

Newswire BLEU Scores

Site ID
BLEU-4
google
0.2057
edinburgh*#
0.1465
rwth
0.1436
cmu
0.1158
casia
0.1001
xmu
0.0817

Newsgroup BLEU Scores

Site ID
BLEU-4
google
0.1432
edinburgh*#
0.1070
rwth
0.1032
cmu
0.1015
casia
0.0916
xmu
0.0782

Broadcast News BLEU Scores

Site ID
BLEU-4
google
0.1482
rwth
0.1224
edinburgh*#
0.1090
cmu
0.1020
casia
0.0891
xmu
0.0775

Broadcast Conversation BLEU Scores

Site ID
BLEU-4
google
0.1206
edinburgh*#
0.1157
rwth
0.1010
cmu
0.0957
casia
0.0812
xmu
0.0801


Unlimited Plus Data track

NIST data setBLEU-4
Site IDLanguageOverallNewswireNewsgroupBroadcast News
googleArabic0.45690.50600.37270.4076
googleChinese0.36150.37250.29260.3859

GALE data setBLEU-4
Site IDLanguageOverallNewswireNewsgroupBroadcast NewsBroadcast Conversation
googleArabic0.20240.28200.13590.19320.1925
googleChinese0.15760.20860.14540.15320.1300


Release History