The Information Access Division in ITL has recently hosted a series of workshops reporting the results of NIST speech and speaker recognition evaluations. These annual evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in emerging speech and speaker recognition technology.
The Rich Transcription 2002 (RT-02) Workshop was held for the first time in May 2002, in Fairfax, Virginia. It reported results of evaluations of advanced speech-to-text technology. These evaluations encompassed English language audio test data from three sources: broadcast news, telephone conversations, and meetings; the tasks included automatic transcription and automatic metadata extraction. Participants included four industrial (includes SRI), three academic, and one governmental (LIMSI from France) organization. Metadata in this initial RT evaluation was limited to the clustering of speech intervals according to speaker identity. This is similar to the speaker segmentation task in the NIST speaker recognition evaluations. This was the first time recognition of meeting room data was evaluated. The test data included parts of meetings arranged and recorded recently at NIST, LDC, CMU, and ICSI. This was the second NIST evaluation that included speech from cellular telephone conversations. The best word error rate achieved on such conversations was significantly lower than in the previous evaluation. There was also a large decrease in the best word error rates achieved on broadcast news compared to previous evaluations. Details are available at http://www.nist.gov/speech/tests/rt/rt2002/.
The 2002 Speaker Recognition Workshop was held in Virginia in May. Four industrial, four governmental (2 U.S and 2 foreign), and 17 academic organizations participated. The number of participating sites more than doubled to 25, and workshop attendance was the highest ever. The countries represented included China, India, Israel, Lebanon, South Africa, France, Switzerland, Greece, Spain, Sweden, and the U.S. The evaluation covered several basic tasks in text-independent speaker recognition, including speaker detection when a single speaker is present, speaker detection when multiple speakers are present, and speaker segmentation when multiple unknown speakers are present. The evaluations were conducted using Conversational Telephone Speech data provided by the Linguistic Data Consortium (LDC). The multi-modal speaker detection condition evaluation used data taken from the FBI Voice Database. More details are available at http://www.nist.gov/speech/tests/spk/2002/index.htm. This was the first time in which the main test data were from cellular telephone speech, though some cellular speech was included in the 2001 evaluation. The best systems showed improved recognition performance on cellular data compared with the 2001 results. The multi-modal condition featured forensic-type data collected by the FBI. Speakers contributed data over a telephone line and using either of two types of microphones, thus allowing contrasts of performance when different types of data were used for training and testing.
The kickoff meeting for the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Program was coordinated by NIST and held May 9-10 in Virginia immediately following the RT-02 Workshop. Six research teams were competitively chosen to participate in this five year program, which expects to develop systems with greatly reduced speech recognition error rates which can run in real-time. The teams were led by researchers from ICSI, Microsoft, BBN, Cambridge University, IBM, and SRI, but a number of leading researchers from other institutions were also included. NIST has been designated to coordinate the essential program infrastructure, and the research teams will participate in future NIST Rich Transcription Evaluations.
The first of two Automatic
Content Extraction (ACE) evaluations
in 2002 was held in February. The
objective of the ACE program is to develop automatic content extraction
technology to support automatic processing of human language in text form. In addition to the four participants from
previous evaluations (BBN, MITRE, NYU, and SRI), this evaluation included three
first time participants, University of Sheffield, Baldwin Language
Technologies, and ClearForest. The evaluation consisted of the standard
Entity Detection Task (EDT) with an optional task of cross document EDT. Cross document EDT was implemented in the
form of a database filling task, where the system had prior knowledge of
several hundred entities, and the task was to detect all occurrences of each
known entity in the evaluation data.
The next workshop is scheduled for September 2002.
Contact: John Garofolo, ext. 3193
Alvin Martin, ext. 3169
Jonathan Fiscus, ext. 3182