Significant Improvements in Speech Technologies Demonstrated in ITL's Rich Transcription 2003 Evaluation

 

 The Information Access Division in ITL hosted the Rich Transcription Spring 2003 (RT-03S) Workshop May 18-20, in Boston, MA, reporting the results of recent advanced speech technology evaluations conducted by NIST.  These evaluations were sponsored by the Defense Advanced Research Project Agency (DARPA) to support the evaluation needs of its Effective, Affordable, Reusable Speech-to-Text (EARS) program.  These periodic evaluations help promote and gauge advances in the state-of-the-art in technologies for the recognition of human-human speech where the goal is to create transcriptions which are both readable by humans and useful for machines. Together with other sophisticated downstream language processing technologies, these technologies will eventually make it possible for machines to do a much better job of detecting, extracting, summarizing, and translating important information.

 

The current foci in the RT evaluations include: the generation of text for the words spoken in human-human communications (speech-to-text transcription - STT), identification of which speaker said which words (speaker diarization), detection and classification of sentence-like units (SU detection), and identification of disfluencies (disfluency detection).  This minimal component set will enable the automatic generation of transcripts which can be rendered in a closed-caption-like form.  However, the eventual goal is to produce more enriched transcriptions including topic change/paragraphing, identification of proper names, numbers, acronyms, lists, cross-references, and other attributes normally found in human-generated transcriptions.  The recent evaluation included STT tasks in English, Mandarin, and Arabic languages for broadcast news and conversational telephone speech.  It also included a speaker diarization task for English broadcast news and conversational telephone speech. 

 

The results from the RT-03 Spring evaluation indicated significant improvements in STT technology from only one year ago for both broadcast news and conversational telephone speech in English.  The Word Error Rate (percentage of incorrectly recognized words) for state-of-the-art STT on broadcast news is now at approximately 10%.  The same measure for telephone conversations is near 20%.  The error rates for Mandarin and Arabic are significantly higher than their English counterparts.  The STT participants included AT&T, BBN, Carnegie Mellon University/University of Karlsruhe Interactive Systems Labs, IBM, Panasonic, SRI, and internationally, Cambridge University-England, and Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI) - France.  The results for the speaker diarization tests indicated that performance for that task is quite good -- with error rates of less than 10% for the English broadcast news domain.  However, the diarization task required only the generation of arbitrary speaker IDs within a single broadcast.  A more difficult task will be to produce an absolute identification of particular "known" speakers such as political figures and news reporters.  The participants in the diarization tasks included International Computer Science Institute (ICSI), Massachusetts Institute of Technology (MIT) Lincoln Labs, Panasonic, and internationally, Cambridge University - England, Communication Langagière et Interaction Personne-Système (CLIPS) - France, Laboratoire Informatique d’Avignon (LIA) - France, LIMSI - France, and ELISA (a joint effort between CLIPS and LIA).

 

This series of evaluations provides an important contribution to the direction of research efforts and the calibration of technical capabilities in emerging speech recognition technologies.  Although the RT evaluations are sponsored by DARPA and focus largely on the EARS program, they are open to the greater national and international speech research communities. Details are available at http://www.nist.gov/speech/tests/rt/

 

 

Contact:  John Garofolo, ext. 3193

                  David Pallett, ext. 2935