The 2007 NIST Language Recognition Evaluation Results

Date of Release: Thursday, May 1, 2008

The 2007 NIST Language Recognition Evaluation (LRE-07) was the fourth in the evaluation series that began in 1996 to evaluate human language recognition technology. NIST conducts these evaluations in order to support language recognition (LR) research and help advance the state-of-the-art in language recognition. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official LRE-07 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial LR products were generally from research systems, not commercially available products. Since LRE-07 was an evaluation of research algorithms, the LRE-07 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

The data, protocols, and metrics employed in this evaluation were chosen to support LR research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Task

The evaluation task was posed as a detection task because it was the simplest and most general form of the discrimination task and because the results could be used to estimate the system performance on other more complex language recognition tasks. Given a segment of speech and a language (or dialect) of interest to be detected, the task was to decide whether the target language (or dialect) was spoken in the speech segment by outputing two answers: (1) a hard decision (a yes/no answer) and (2) a likelihood score (the higher the score the greater the belief that the answer is yes).

Evaluation Tests

Performance of a detection system is strongly affected by how different or similar the languages (or dialects) to be detected. Six tests were devised to probe the various language and dialect differences. Table 1 lists the six tests and the corresponding target languages/dialects.

Table 1: The six tests where target languages for each test are limited to those marked with an x. LR is short for Language Recognition and DR is short for Dialect Recognition.
Target
Languages /
Dialects
General
LR
Chinese
LR
English
DR
Mandarin
DR
Hindustani
DR
Spanish
DR
Arabic
x
         
Bengali x          
Farsi x          
German x          
Japanese x          
Korean x          
Russian x          
Tamil x          
Thai x          
Vietnamese x          
Chinese x          
   Cantonese   x        
   Mandarin   x        
      Mainland       x    
      Taiwan       x    
   Min   x        
   Wu   x        
English x          
   American     x      
   Indian     x      
Hindustani x          
   Hindi         x  
   Urdu         x  
Spanish x          
   Caribbean           x
   non-Caribbean           x

Evaluation Conditions

Two sets of evaluation conditions were included to measure how additional information given to the systems effect their performance. Participants could choose to optimize both or either of these conditions.

Training Data

Participants in the evaluation were given a basic data set to train their systems. Figure 1 shows the distribution of the training data across the different languages and dialects in terms of the number of unique speakers. Some languages/dialects have more data than other due to data availability. Participants could also augmented this data with additional data. However, the extra data must be publicly available data and the source for the data must be documented in their system descriptions.

a bar graph showing the distribution of the training data set Figure 1: Training data distribution.

Evaluation Data

The evaluation data set contained segments drawn from approximately 40 conversation sides for each target language and dialect. For some languages and dialects, unused data left over from the previously collected data pool was also thrown into the mix. The evaluation data set also contained data from five unknown (undisclosed) nontarget languages (French, Indonesian, Italian, Punjabi, and Tagalog). Figure 2 shows the distribution of the evaluation data across the different languages and dialects. Note that speakers may overlap across languages but are unique within a target language (and subsequently the dialects within the target language).

Test Segments

For each conversation side, two subsets of segments were extracted. Each subset contained 3 segments for the three durations 3-second, 10-second, and 30-second with the 3-second segment a subset of the 10-second segment and the 10-second segment a subset of the 30-second segment.

a bar graph showing the distribution of the evaluation data set Figure 2: Evaluation data distribution.

Evaluation Rules

Participants were given a set of rules to follow during the evaluation. The rules were created to ensure the quality of the evaluation and can be found in the evaluation plan.

Performance Measurement

The performance of a detection system is characterized by the false alarm and miss error rates. We defined a cost detection function Cavg that is language weighted and equally weighted sum of the two error types.

lre07 metric Figure 3: Evaluation metric.

Results Representation

In addition to bar charts which were used to plot the Cavg obtained, Detection Error Tradeoff (DET) curves, a linearized version of ROC curve, were used to show all operating points as the likelihood threshold was varied. Two special operating points–(a) the system decision point and (b) the optimal decision point–were plotted on the curve. More information on the DET curve can be found in a paper by Martin, A. F. et al., "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903.

Participating Organizations

A diverse group of organizations from four continents participated in the evaluation. Table 2 lists some of these organizations in alphabetical order.

Organization Location
Beijing Naphoo Technology Company+ China
Brno University of Technology Czech Republic
Georgia Institute of Technology USA
Groupe des Ecoles des Telecommunication, Ecole Nationale Superieure des Telecommunications France
IBM USA
IKERLAN Technological Research Center Spain
Institut de Recherche en Informatique de Toulouse France
Institute for Infocomm Research Singapore
Institute of Acoustics, Chinese Academy of Sciences+ China
Institut National de Recherche sur les Transports et Leur Securite France
International Computer Science Institute (USA) USA
Laboratoire d'Informatique pour la Mecanique et les Sciences de l'Ingenieur France
MIT Lincoln Laboratory USA
Nanyang Technological University Singapore
Politecnico di Torino Italy
Spescom Datavoice South Africa
Telefonica I & D Spain
TNO Human Factors The Netherlands
Tsinghua University China
Universidad Autnoma de Madrid Spain
University of the Basque Country Spain
University of Stellenbosch South Africa
University of Science and Technology of China+ China

+ indicates the organization did not send a representative to the post-evaluation workshop.

The following organizations dropped out of the evaluation.

Evaluation Results

The results are presented without attribution to the participating organizations to emphasize that the purpose of this evaluation is to support LR research, not a product testing exercise.

The graphs below show the results for each test and condition. The bar charts show the Cavg in increasing order while the DET curves show the tradeoff when the two error types are varied. Note that the circle on the DET curve indicates the system decision point while the triangle indicates the optimal decision point.

We also conducted a pilot experiment in which we asked a human to perform dialect detection of a language in which he/she is a native speaker on the same data. Although the four dialect tests were assessed (one human for each dialect test), only the results for English, Mandarin, and Spanish are shown (labeled human in the graphs). The human listener for the Hindustani data did not complete the task.

 
Tests Conditions
Closed   Open
30-sec 10-sec 3-sec 30-sec 10-sec 3-sec
General LR General language recognition closed-set 30 seconds results General language recognition closed-set 10 seconds results General language recognition closed-set 3 seconds results   General language recognition open-set 30 seconds results General language recognition open-set 10 seconds results General language recognition open-set 3 seconds results
General language recognition closed-set 30 seconds results General language recognition closed-set 10 seconds results General language recognition closed-set 3 seconds results   General language recognition open-set 30 seconds results General language recognition open-set 10 seconds results General language recognition open-set 3 seconds results
Chinese LR Chinese language recognition closed-set 30 seconds results Chinese language recognition closed-set 10 seconds results Chinese language recognition closed-set 3 seconds results   Chinese language recognition open-set 30 seconds results Chinese language recognition open-set 10 seconds results Chinese language recognition open-set 3 seconds results
Chinese language recognition closed-set 30 seconds results Chinese language recognition closed-set 10 seconds results Chinese language recognition closed-set 3 seconds results   Chinese language recognition open-set 30 seconds results Chinese language recognition open-set 10 seconds results Chinese language recognition open-set 3 seconds results
English LR English dialect recognition closed-set 30 seconds results English dialect recognition closed-set 10 seconds results English dialect recognition closed-set 3 seconds results   English dialect recognition open-set 30 seconds results English dialect recognition open-set 10 seconds results English dialect recognition open-set 3 seconds results
English dialect recognition closed-set 30 seconds results English dialect recognition closed-set 10 seconds results English dialect recognition closed-set 3 seconds results   English dialect recognition open-set 30 seconds results English dialect recognition open-set 10 seconds results English dialect recognition open-set 3 seconds results
Mandarin DR Mandarin dialect recognition closed-set 30 seconds results Mandarin dialect recognition closed-set 10 seconds results Mandarin dialect recognition closed-set 3 seconds results   Mandarin dialect recognition open-set 30 seconds results Mandarin dialect recognition open-set 10 seconds results Mandarin dialect recognition open-set 3 seconds results
Mandarin dialect recognition closed-set 30 seconds results Mandarin dialect recognition closed-set 10 seconds results Mandarin dialect recognition closed-set 3 seconds results   Mandarin dialect recognition open-set 30 seconds results Mandarin dialect recognition open-set 10 seconds results Mandarin dialect recognition open-set 3 seconds results
Hindustani DR Hindustani dialect recognition closed-set 30 seconds results Hindustani dialect recognition closed-set 10 seconds results Hindustani dialect recognition closed-set 3 seconds results   Hindustani dialect recognition open-set 30 seconds results Hindustani dialect recognition open-set 10 seconds results Hindustani dialect recognition open-set 3 seconds results
Hindustani dialect recognition closed-set 30 seconds results Hindustani dialect recognition closed-set 10 seconds results Hindustani dialect recognition closed-set 3 seconds results   Hindustani dialect recognition open-set 30 seconds results Hindustani dialect recognition open-set 10 seconds results Hindustani dialect recognition open-set 3 seconds results
Spanish DR Spanish dialect recognition closed-set 30 seconds results Spanish dialect recognition closed-set 10 seconds results Spanish dialect recognition closed-set 3 seconds results   Spanish dialect recognition open-set 30 seconds results Spanish dialect recognition open-set 10 seconds results Spanish dialect recognition open-set 3 seconds results
Spanish dialect recognition closed-set 30 seconds results Spanish dialect recognition closed-set 10 seconds results Spanish dialect recognition closed-set 3 seconds results   Spanish dialect recognition open-set 30 seconds results Spanish dialect recognition open-set 10 seconds results Spanish dialect recognition open-set 3 seconds results

Language Pair Confusability

The error matrix below shows the false alarm error rates for each target/segment language pair for one particularly good system that participated in the evaluation. Note that the diagonal boxes contain no information and should be ignored. The non-diagonal boxes show the degree of confusability between each language pair with the darker the box the more confusable between the two languages. The matrix also includes a row (the last row) with the miss error rates.

error matrix showing the false alarm error rates for each language pair
Figure 4: Error matrix for each target/segment language pair for one good system inthe open-set general language 3 second segments.

Evaluation History

Figure 5 summarizes the evaluation in terms of the number of languages, dialects, and participants since its first evaluation in 1996.

a line graph showing how LRE has grown from 1996 to 2007 in terms of the number of languages/dialects and participants Figure 5: LRE History 1996-2007.

Figure 6 summarizes the best Cavg obtained in each evaluationi for each of the three durations. Note a decrease in performance for the 30 seconds segments in the 2005 evaluation when Indian English was introduced.

a line graph showing best system performance from 1996 to 2007 Figure 6: LRE Performance History 1996-2007.

Future Plans

The next language evaluation is planned for Spring/Summer 2009. It is hoped that a richer set of languages/dialects will be available.