The 2009 NIST Language Recognition Evaluation Results

Date of Release: Tuesday, August 11, 2009

The 2009 NIST Language Recognition Evaluation (LRE09) was the fifth in the evaluation series that began in 1996 to evaluate human language recognition technology. NIST conducts these evaluations in order to support language recognition (LR) research and help advance the state-of-the-art in language recognition. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official LRE09 evaluation plan. More information and a summary of results is available in a brief presentation.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial LR products were generally from research systems, not commercially available products. Since LRE09 was an evaluation of research algorithms, the LRE09 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

The data, protocols, and metrics employed in this evaluation were chosen to support LR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Task

The evaluation task was posed as a detection task because it was the simplest and most general form of the discrimination task and because the results could be used to estimate the system performance on other more complex language recognition tasks. Given a segment of speech and a language of interest to be detected, the task was to decide whether the target language was spoken in the speech segment by outputing (1) a hard decision (a yes/no answer) and (2) a likelihood score (the higher the score the greater the belief that the answer is yes).

Evaluation Tests

Performance of a detection system is strongly affected by how different or similar are the languages to be detected. Eight language pairs were selected to probe this. Table 1 lists the target languages and Table 2 lists the language pairs.

Table 1: The 23 target languages.
Target
Languages
Amharic
Bosnian
Cantonese
Creole (Haitian)
Croatian
Dari
English (American)
English (Indian)
Farsi
French
Georgian
Hausa
Hindi
Korean
Mandarin
Pashto
Portuguese
Russian
Spanish
Turkish
Ukrainian
Urdu
Vietnamese

Table 2: The 8 expressed pairs of interest.
Pair of Interest
Bosnian-Croatian
Cantonese-Mandarin
Creole (Haitian)-French
Dari-Farsi
English (American)-English (Indian)
Hindi-Urdu
Portuguese-Spanish
Russian-Ukrainian

 

Evaluation Conditions

Three evaluation conditions were included, the difference between which were the alternative hypothesis posed with each trial-- "all other target languages" for the closed-set condition, "all other languages" for the open-set condition, and "one other (given) language" for the language pair condition. The segment durations in each condition consisted of 30-seconds, 10-seconds, and 3-seconds. Performance was evlautated seperately for each condition and segment duration. Participants could choose to optimize all or any of these conditions:

Training Data

Participants in the evaluation were given converstaional telephone speech data from past LREs to train their systems and, for languages not present in a previous LRE, a limited amount of human annotated Voice of America training data and a large quantity of automatically or unannotated Voice of America training data. Figure 1 shows the distribution of the training data new for LRE09 across the different languages. Some languages have more data than others due to data availability. Participants could also augmented this data with additional data. However, the extra data must be publicly available and the source for the data must be documented in their system descriptions.

a bar graph showing the distribution of the training data set Figure 1: Training data distribution.Number of 30-second human annoated segments.

Evaluation Data

The evaluation data set contained segments drawn from Voice of America broadcasts. For some languages, unused conversational telephone speech data left over from the previously collected data pool was also used in the evaluation. The evaluation data set also contained data from sixteen unknown (undisclosed) nontarget languages. Table 3 shows the distribution of the evaluation data across the different languages.

 

Test Segments

For each conversation side, two subsets of segments were extracted. Each subset contained 3 segments for the three durations 3-second, 10-second, and 30-second with the 3-second segment a subset of the 10-second segment and the 10-second segment a subset of the 30-second segment.

Table 3: The number of 30-second VOA train segments and VOA/CTS test segments by language.

Lang.

VOA  Train

VOA Test

CTS Test

Amharic

171

398

- - - - -

Bosnian

194

355

- - - - -

Cantonese

- - - - -

62

316

Creole-Haitian

186

323

- - - - -

Croatian

181

376

- - - - -

Dari

194

389

- - - - -

English-Am.

- - - - -

374

522

English-Ind.

- - - - -

- - - - -

574

Farsi

- - - - -

338

52

French

196

395

- - - - -

Georgian

142

399

- - - - -

Hausa

200

389

- - - - -

Hindi

- - - - -

397

270

Korean

- - - - -

318

145

Mandarin

- - - - -

390

625

Pashto

197

395

- - - - -

Portuguese

166

397

- - - - -

Russian

- - - - -

254

257

Spanish

- - - - -

385

- - - - -

Turkish

194

394

- - - - -

Ukrainian

194

388

- - - - -

Urdu

- - - - -

347

32

Vietnamese

- - - - -

27

288

Arabic

Out-of-set

187

- - - - -

Azerbaijani

Out-of-set

366

- - - - -

Belorussian

Out-of-set

363

- - - - -

Bengali

Out-of-set

- - - - -

43

Bulgarian

Out-of-set

375

- - - - -

Italian

Out-of-set

- - - - -

30

Japanese

Out-of-set

- - - - -

180

Punjabi

Out-of-set

- - - - -

9

Romanian

Out-of-set

400

- - - - -

Shanghai-Wu

Out-of-set

- - - - -

69

Southern-min

Out-of-set

- - - - -

48

Swahili

Out-of-set

396

- - - - -

Tagalog

Out-of-set

- - - - -

84

Thai

Out-of-set

- - - - -

188

Tibetan

Out-of-set

368

- - - - -

Uzbek

Out-of-set

382

- - - - -

 

Evaluation Rules

Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.

Performance Measurement

The performance of a detection system is characterized by the false alarm and miss error rates. We defined a cost detection function Cavg that is language weighted and equally weighted sum of the two error types.

lre09 metric Figure 2: Evaluation metric.

Results Representation

In addition to bar charts which were used to plot the Cavg obtained, Detection Error Tradeoff (DET) curves, a linearized version of ROC curve, were used to show all operating points as the likelihood threshold was varied. Two special operating points–(a) the system decision point and (b) the optimal decision point–were plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.

Participating Organizations

A diverse group of organizations from four continents participated in the evaluation. Table 4 lists some of these organizations in alphabetical order.

Organization Location

Universidad Autonoma de Madrid

Madrid, Spain

Brno University of Technology

Agnitio

Brno, Czech Republic

Somerset West, South Africa

Institute of Automation, Chinese Academy of Sciences

Beijing, China

Chinese University of Hong Kong

N.T., Hong Kong

University of the Basque Country

Bizkaia, Spain

iFlyTek Speech Lab, EEIS University of Science and Technology of China

HeFei, AnHui, China

Institute for Infocomm Research

Singapore

Institute of Acoustics, Chinese Academy of Sciences

Beijing, China

L2F-Spoken Language Systems Lab INESC-ID Lisboa

Lisbon, Portugal

Laboratorie Informatique D'Avignon

Avignon, France

CNRS-LIMSI  (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)

Orsay, France

Loquendo

Politecnico di Torino

Torino, Italy

Torino, Italy

MIT Lincoln Laboratory

Lexington, MA, USA

National Taipei University of Technology, Department of Electrical Engineering & Graduate Institute of Computer and Communication Engineering

Taipei, Taiwan

Tsinghua University Department of Electrical Engineering

Beijing, China

Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek

Soestenberg, The Netherlands

 

Evaluation Results

The results are presented without attribution to the participating organizations in accordance with a policy established for NIST Language Recognition Evaluations.

The graphs below show the results for each test and condition. The bar charts show the Cavg in increasing order while the DET curves show the tradeoff when the two error types are varied. Note that the circle on the DET curve indicates the system decision point while the triangle indicates the optimal decision point.

 

Conditions 30-sec 10-sec 3-sec  Bars
Closed-Set Language recognition closed-set 30 seconds results Language recognition closed-set 10 seconds results Language recognition closed-set 3 seconds results   Language recognition closed-set results
Open-Set Language recognition open-set 30 seconds results Language recognition open-set 10 seconds results Language recognition open-set 3 seconds results   Language recognition open-set results
Language-Pair         Language pair results

Language-Pair:

Bosnian/Croatian

Bosnian/Croation language pair 30 seconds results Bosnian/Croation language pair 10 seconds results Bosnian/Croation language pair 3 seconds results    

Language-Pair:

Cantonese/Mandarin

Cantonese/Mandarin language pair 30 seconds results Cantonese/Mandarin language pair 10 seconds results Cantonese/Mandarin language pair 3 seconds results    

Language-Pair:

Creole/French

Creole/French language pair 30 seconds results Creole/French language pair 10 seconds results Creole/French language pair 3 seconds results    

Language-Pair:

Dari/Farsi

Dari/Farsi language pair 30 seconds results Dari/Farsi language pair 10 seconds results Dari/Farsi language pair 3 seconds results    

Language-Pair:

American English/Indian English

American English/Indian English language pair 30 seconds results American English/Indian English language pair 10 seconds results American English/Indian English language pair 3 seconds results    

Language-Pair:

Hindia/Urdu

Hindia/Urdu language pair 30 seconds results Hindia/Urdu language pair 10 seconds results Hindia/Urdu language pair 3 seconds results    

Language-Pair:

Portuguese/Spanish

Portuguese/Spanish language pair 30 seconds results Portuguese/Spanish language pair 10 seconds results Portuguese/Spanish language pair 3 seconds results    

Language-Pair:

Russian/Ukrainian

Russian/Ukrainian language pair 30 seconds results Russian/Ukrainian language pair 10 seconds results Russian/Ukrainian language pair 3 seconds results    

 

Evaluation History

Figure 3 summarizes the evaluation in terms of the number of languages, dialects, and participants since its first evaluation in 1999.

a line graph showing how LRE has grown from 1996 to 2009 in terms of the number of target languages, out-of-set languages and participants Figure 3: LRE History 1996-2009.

Figure 4 summarizes the best Cavg obtained in each evaluationi for each of the three durations. Note a decrease in performance for the 30 seconds segments in the 2005 evaluation when Indian English was introduced.

a line graph showing best system performance from 1996 to 2009 Figure 4: LRE Performance History 1996-2009.

Future Plans

The next language evaluation is planned for Spring/Summer 2011.