NIST is pleased to announce Rich Transcription Evaluation 2002 (RT-02), the first of an anticipated series of annual evaluations for advanced STT (Speech-to-Text) technology. RT-02 supersedes the previously scheduled Hub-5 Conversational Telephone Speech Evaluation.
The evaluation will occur in early April; the evaluation workshop, in early May.
RT-02 will encompass English language audio test data from three sources: (1) broadcast news, (2) telephone conversations, and (3) meetings. The tasks include automatic transcription and automatic metadata extraction described below.
Participants are required to perform automatic transcription of telephone conversations; they are encouraged to perform automatic transcription of broadcast news and meetings. In addition, participants are strongly encouraged to perform automatic metadata extraction from any or all of the three sources. The desired metadata, this year, will consist of information about speakers.
This is a step in the direction of "rich transcription", a concept to be further defined and pursued in the DARPA EARS Program. (Note: RT-02 participants need not have any association with EARS.)
Quoting from the DARPA BAA:
The goal of the EARS program is to produce powerful new speech-to-text (automatic transcription) technology whose outputs are substantially richer and much more accurate than currently possible. The program will focus on natural, unconstrained human-human speech from broadcasts and telephone conversations in a number of languages. The intent is to create core enabling technology suitable for a wide range of advanced applications, not to develop those applications. Inputs and outputs will be in the same language.
NIST's Rich Transcription Evaluation (RT-02) is intended to provide opportunities for the research community to demonstrate the strengths (and define the limits) of current speech-to-text technologies. Defining what the rich transcription concept consists of will be an important first step.
To stimulate the discussion, in January 2002, NIST ran a metadata anotation experiment. The source data for this experiment appears on the metadata annotation experiment data sample page. Researchers were encouraged to work with the sample source data (audio files and associated transcriptions) to create a set of metadata annotation definitions and sample annotations which they believed were of interest for the RT metadata task. Several research sites submitted a variety of suggested types. Using these suggestions, NIST created a putative set of initial metadata annotation types (speaker change/ID, acronym, verbal edit interval, named entity/type, numeric expression/type, and temporal expression/type) which we believed could be implemented this year. (Note that we had also conducted an earlier internal experiment which demonstrated that annotation of sentences or punctuation in spontaneous interactive speech was a very difficult task and required further study and we chose to defer exploration of those types.) However, given the very tight schedule, we decided to focus only on the detection of speaker changes and clustering within excerpt for this first evaluation. This would permit us to develop and implement an infrastructure for metadata annotation evaluation while continuing to study and discuss other metadata types of interest for implementation in future evaluations.
The complete RT-02 Evaluation Plan, v3, 4/19/2002 [ps, pdf] is available on this website. Test participants should refer to this document for information regarding the corpora, software, protocols, and rules to be used in implementing this test.
A brief synopsis of the evaluation is provided below.
The RT-02 Scoring Software Page contains pointers to the software packages and ancillary files to be used in evaluating RT-02 systems and instructions for how to implement them.
Meeting Room Training Data Transcripts
RT-02 Test Data
The test data from each of these sources will include:
The broadcast news test data, approximately 60 minutes in length, will be similar to that used in previous broadcast news evaluations (Hub-4).
The telephone data, approximately 300 minutes in length, will be similar to that used in the 2001 conversational speech evaluation (Hub-5). There will be three test sets drawn from (1) unreleased original Switchboard, (2) Switchboard II phase 3, and (3) Switchboard cellular phase 2. Each test set will contain five minutes from each of twenty conversations, approximately one hour forty minutes in length. Note, the speaker change/speaker ID information provided for Hub-5 is not to be used for the speaker metadata annotation task.
The meeting data, approximately 80 minutes in length, will consist of excerpts from recent meetings held in the CMU, ICSI, LDC, and NIST Meeting Data Collection Laboratories. Some of the test data will come from a tabletop microphone. Gain-adjusted data from individual head-mounted, or lapel, microphones worn on each subject will be available for contrastive tests. (Although meeting data is not included in EARS, it is included in RT-02 to provide a useful reference point for possible future programs.)
RT-02 Training Data
As in previous evaluations, any language corpora/resources which are publicly available at the time of start of the evaluation may be used for training or development testing purposes. Note that there is no specifically-designated development test data for this evaluation. However, the following is suggested training material for this evaluation.
All existing Hub-4 training data may be used.
Conversational Telephone Speech
Allowable training data consists of the entire Switchboard (i.e., Switchboard I) Corpus as released, the entire Switchboard II phases 1 & 2 Corpus, and all English conversations of the Call_Home Corpus, including those originally designated for training and those used as test data in previous evaluations.
Given the short time frame and that this is a new domain, only a small data set comparable to the evaluation test set will be made available for training.
RT-02 Evaluation Tasks
In addition to producing word transcripts, evaluated by the standard metric of word error rate using the NIST SCLITE scoring software, RT-02 participants will be asked to produce associated metadata consisting of speaker change detection and clustering within excerpt. This metadata will be evaluated in a manner similar to the 2001 Speaker Recognition Evaluation for N-Speaker Segmentation (Section 188.8.131.52) using the Speaker Segmentation Scoring Software.
Different systems (acoustic and/or language models) for each of the three data types, or one multi-purpose system may be developed.
Sites are encouraged to form teams encompassing different types of expertise and experience for participation in RT-02. Participants need not have any association with the EARS program.
View the current RT-02 schedule.
Page Created: September 28, 2007
Multimodal Information Group
is part of
NIST is an agency of the U.S. Department of Commerce
Accessibility Statement | Disclaimer | FOIA