Significant Improvements in Speech Technologies
Demonstrated in ITL's Rich Transcription 2003 Evaluation
The
Information Access Division in ITL hosted the Rich Transcription Spring 2003
(RT-03S) Workshop May 18-20, in Boston, MA, reporting the results of recent
advanced speech technology evaluations conducted by NIST. These evaluations were sponsored by the
Defense Advanced Research Project Agency (DARPA) to support the evaluation
needs of its Effective, Affordable, Reusable Speech-to-Text (EARS)
program. These periodic evaluations
help promote and gauge advances in the state-of-the-art in technologies for the
recognition of human-human speech where the goal is to create transcriptions
which are both readable by humans and useful for machines. Together with other
sophisticated downstream language processing technologies, these technologies
will eventually make it possible for machines to do a much better job of
detecting, extracting, summarizing, and translating important information.
The current foci in the RT evaluations include:
the generation of text for the words spoken in human-human communications
(speech-to-text transcription - STT), identification of which speaker said
which words (speaker diarization), detection and classification of
sentence-like units (SU detection), and identification of disfluencies
(disfluency detection). This minimal
component set will enable the automatic generation of transcripts which can be
rendered in a closed-caption-like form.
However, the eventual goal is to produce more enriched transcriptions
including topic change/paragraphing, identification of proper names, numbers,
acronyms, lists, cross-references, and other attributes normally found in
human-generated transcriptions. The
recent evaluation included STT tasks in English, Mandarin, and Arabic languages
for broadcast news and conversational telephone speech. It also included a speaker diarization task
for English broadcast news and conversational telephone speech.
The results from the RT-03 Spring evaluation
indicated significant improvements in STT technology from only one year ago for
both broadcast news and conversational telephone speech in English. The Word Error Rate (percentage of
incorrectly recognized words) for state-of-the-art STT on broadcast news is now
at approximately 10%. The same measure
for telephone conversations is near 20%.
The error rates for Mandarin and Arabic are significantly higher than
their English counterparts. The STT
participants included AT&T, BBN, Carnegie Mellon University/University of
Karlsruhe Interactive Systems Labs, IBM, Panasonic, SRI, and internationally,
Cambridge University-England, and Laboratoire d'Informatique pour la Mécanique
et les Sciences de l'Ingénieur (LIMSI) - France. The results for the speaker diarization
tests indicated that performance for that task is quite good -- with error
rates of less than 10% for the English broadcast news domain. However, the diarization task required only
the generation of arbitrary speaker IDs within a single broadcast. A more difficult task will be to produce an
absolute identification of particular "known" speakers such as
political figures and news reporters.
The participants in the diarization tasks included International Computer
Science Institute (ICSI), Massachusetts Institute of Technology (MIT) Lincoln
Labs, Panasonic, and internationally, Cambridge University - England,
Communication Langagière et Interaction Personne-Système (CLIPS) - France,
Laboratoire Informatique d’Avignon (LIA) - France, LIMSI - France, and ELISA (a
joint effort between CLIPS and LIA).
This series of evaluations provides an important
contribution to the direction of research efforts and the calibration of
technical capabilities in emerging speech recognition technologies. Although the RT evaluations are sponsored by
DARPA and focus largely on the EARS program, they are open to the greater
national and international speech research communities. Details are available
at http://www.nist.gov/speech/tests/rt/
Contact:
John Garofolo, ext. 3193
David Pallett, ext. 2935