Date of Release: Tuesday, August 11, 2009
The 2009 NIST Language Recognition Evaluation (LRE09) was the fifth in the evaluation series that began in 1996 to evaluate human language recognition technology. NIST conducts these evaluations in order to support language recognition (LR) research and help advance the state-of-the-art in language recognition. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official LRE09 evaluation plan. More information and a summary of results is available in a brief presentation.
Disclaimer
These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial LR products were generally from research systems, not commercially available products. Since LRE09 was an evaluation of research algorithms, the LRE09 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.
The data, protocols, and metrics employed in this evaluation were chosen to support LR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
The evaluation task was posed as a detection task because it was the simplest and most general form of the discrimination task and because the results could be used to estimate the system performance on other more complex language recognition tasks. Given a segment of speech and a language of interest to be detected, the task was to decide whether the target language was spoken in the speech segment by outputing (1) a hard decision (a yes/no answer) and (2) a likelihood score (the higher the score the greater the belief that the answer is yes).
Performance of a detection system is strongly affected by how different or similar are the languages to be detected. Eight language pairs were selected to probe this. Table 1 lists the target languages and Table 2 lists the language pairs.
Table 1: The 23 target languages.| Target Languages |
|---|
| Amharic |
| Bosnian |
| Cantonese |
| Creole (Haitian) |
| Croatian |
| Dari |
| English (American) |
| English (Indian) |
| Farsi |
| French |
| Georgian |
| Hausa |
| Hindi |
| Korean |
| Mandarin |
| Pashto |
| Portuguese |
| Russian |
| Spanish |
| Turkish |
| Ukrainian |
| Urdu |
| Vietnamese |
Table 2: The 8 expressed pairs of interest.
| Pair of Interest |
|---|
| Bosnian-Croatian |
| Cantonese-Mandarin |
| Creole (Haitian)-French |
| Dari-Farsi |
| English (American)-English (Indian) |
| Hindi-Urdu |
| Portuguese-Spanish |
| Russian-Ukrainian |
Three evaluation conditions were included, the difference between which were the alternative hypothesis posed with each trial-- "all other target languages" for the closed-set condition, "all other languages" for the open-set condition, and "one other (given) language" for the language pair condition. The segment durations in each condition consisted of 30-seconds, 10-seconds, and 3-seconds. Performance was evlautated seperately for each condition and segment duration. Participants could choose to optimize all or any of these conditions:
Participants in the evaluation were given converstaional telephone speech data from past LREs to train their systems and, for languages not present in a previous LRE, a limited amount of human annotated Voice of America training data and a large quantity of automatically or unannotated Voice of America training data. Figure 1 shows the distribution of the training data new for LRE09 across the different languages. Some languages have more data than others due to data availability. Participants could also augmented this data with additional data. However, the extra data must be publicly available and the source for the data must be documented in their system descriptions.
Figure 1: Training data distribution.Number of 30-second human annoated segments.
The evaluation data set contained segments drawn from Voice of America broadcasts. For some languages, unused conversational telephone speech data left over from the previously collected data pool was also used in the evaluation. The evaluation data set also contained data from sixteen unknown (undisclosed) nontarget languages. Table 3 shows the distribution of the evaluation data across the different languages.
For each conversation side, two subsets of segments were extracted. Each subset contained 3 segments for the three durations 3-second, 10-second, and 30-second with the 3-second segment a subset of the 10-second segment and the 10-second segment a subset of the 30-second segment.
Table 3: The number of 30-second VOA train segments and VOA/CTS test segments by language.
Lang. |
VOA Train |
VOA Test |
CTS Test |
Amharic |
171 |
398 |
- - - - - |
Bosnian |
194 |
355 |
- - - - - |
Cantonese |
- - - - - |
62 |
316 |
Creole-Haitian |
186 |
323 |
- - - - - |
Croatian |
181 |
376 |
- - - - - |
Dari |
194 |
389 |
- - - - - |
English-Am. |
- - - - - |
374 |
522 |
English-Ind. |
- - - - - |
- - - - - |
574 |
Farsi |
- - - - - |
338 |
52 |
French |
196 |
395 |
- - - - - |
Georgian |
142 |
399 |
- - - - - |
Hausa |
200 |
389 |
- - - - - |
Hindi |
- - - - - |
397 |
270 |
Korean |
- - - - - |
318 |
145 |
Mandarin |
- - - - - |
390 |
625 |
Pashto |
197 |
395 |
- - - - - |
Portuguese |
166 |
397 |
- - - - - |
Russian |
- - - - - |
254 |
257 |
Spanish |
- - - - - |
385 |
- - - - - |
Turkish |
194 |
394 |
- - - - - |
Ukrainian |
194 |
388 |
- - - - - |
Urdu |
- - - - - |
347 |
32 |
Vietnamese |
- - - - - |
27 |
288 |
Arabic |
Out-of-set |
187 |
- - - - - |
Azerbaijani |
Out-of-set |
366 |
- - - - - |
Belorussian |
Out-of-set |
363 |
- - - - - |
Bengali |
Out-of-set |
- - - - - |
43 |
Bulgarian |
Out-of-set |
375 |
- - - - - |
Italian |
Out-of-set |
- - - - - |
30 |
Japanese |
Out-of-set |
- - - - - |
180 |
Punjabi |
Out-of-set |
- - - - - |
9 |
Romanian |
Out-of-set |
400 |
- - - - - |
Shanghai-Wu |
Out-of-set |
- - - - - |
69 |
Southern-min |
Out-of-set |
- - - - - |
48 |
Swahili |
Out-of-set |
396 |
- - - - - |
Tagalog |
Out-of-set |
- - - - - |
84 |
Thai |
Out-of-set |
- - - - - |
188 |
Tibetan |
Out-of-set |
368 |
- - - - - |
Uzbek |
Out-of-set |
382 |
- - - - - |
Evaluation Rules
Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.
Performance Measurement
The performance of a detection system is characterized by the false alarm and miss error rates. We defined a cost detection function Cavg that is language weighted and equally weighted sum of the two error types.
Figure 2: Evaluation metric.
Results Representation
In addition to bar charts which were used to plot the Cavg obtained, Detection Error Tradeoff (DET) curves, a linearized version of ROC curve, were used to show all operating points as the likelihood threshold was varied. Two special operating points(a) the system decision point and (b) the optimal decision pointwere plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.
Participating Organizations
A diverse group of organizations from four continents participated in the evaluation. Table 4 lists some of these organizations in alphabetical order.
| Organization | Location |
|---|---|
Universidad Autonoma de Madrid |
Madrid, Spain |
Brno University of Technology Agnitio |
Brno, Czech Republic Somerset West, South Africa |
Institute of Automation, Chinese Academy of Sciences |
Beijing, China |
Chinese University of Hong Kong |
N.T., Hong Kong |
University of the Basque Country |
Bizkaia, Spain |
iFlyTek Speech Lab, EEIS University of Science and Technology of China |
HeFei, AnHui, China |
Institute for Infocomm Research |
Singapore |
Institute of Acoustics, Chinese Academy of Sciences |
Beijing, China |
L2F-Spoken Language Systems Lab INESC-ID Lisboa |
Lisbon, Portugal |
Laboratorie Informatique D'Avignon |
Avignon, France |
CNRS-LIMSI (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur) |
Orsay, France |
Loquendo Politecnico di Torino |
Torino, Italy Torino, Italy |
MIT Lincoln Laboratory |
Lexington, MA, USA |
National Taipei University of Technology, Department of Electrical Engineering & Graduate Institute of Computer and Communication Engineering |
Taipei, Taiwan |
Tsinghua University Department of Electrical Engineering |
Beijing, China |
Nederlandse Organisatie voor Toegepast Natuurwetenschappelijk Onderzoek |
Soestenberg, The Netherlands |
Evaluation Results
The results are presented without attribution to the participating organizations in accordance with a policy established for NIST Language Recognition Evaluations.
The graphs below show the results for each test and condition. The bar charts show the Cavg in increasing order while the DET curves show the tradeoff when the two error types are varied. Note that the circle on the DET curve indicates the system decision point while the triangle indicates the optimal decision point.
Evaluation History
Figure 3 summarizes the evaluation in terms of the number of languages, dialects, and participants since its first evaluation in 1999.
Figure 3: LRE History 1996-2009.
Figure 4 summarizes the best Cavg obtained in each evaluationi for each of the three durations. Note a decrease in performance for the 30 seconds segments in the 2005 evaluation when Indian English was introduced.
Figure 4: LRE Performance History 1996-2009.
Future Plans
The next language evaluation is planned for Spring/Summer 2011.