Questions and Answers concerning the Speech Recognition Task.

Updated on May 1st, 2000.

Q: I am about to take part in the Speech Document Retrieval Task. Where should I begin?

A: You should first read the Evaluation Specification. This document provides all the relevant information that you need to perform speech recognition in the frame provided by NIST.

If you still have questions, please refer to this document to find complementary information.

Back to top.

Q: What are the file formats used for the SDR2000 Speech Recognition Task?

A: There are five file formats that you should be aware of:

For more information regarding the stm format, see the SCLITE documentation.

Back to top.

Q: What are the files to be recognized and where can I find them?

The corpus for this evaluation is the February through June subset of the TDT-2 corpus as specified in Section 6 of the Evaluation Specification. The TDT-2 speech waveform files are available on CD-ROM from the Linguistic Data Consortium (LDC) (LDC order number: LDC99S82).

The specifications for recognition training and implementation for the SDR track are given in Section 9 of the Evaluation Specification. Please read this document carefully!

A: The files to be recognized are provided for each of the two permitted recognition modes in the ndx format:

Back to top.

Q: How can I score my speech recognition results?

A: First, you need several software utilities and data files. Before scoring, you will need to properly format and filter your recognizer transcripts. We have provided several convenient conversion scripts to convert files between several formats and perform index-based exclusionary filtering. Only excerpts for which there are reference transcripts may be scored. (E.g., commercials and other untranscribed excerpts cannot be scored). The filtering scripts and procedures must first be applied to exclude non-transcribed regions. Once your recognizer transcript is properly formatted and filtered, you can then implementscoring.

Back to top.

Required software and files

To perform scoring, you need to download and install the following software packages (version numbers are current as of May 26, 1999) and conversion scripts: You will also need the following scripts: The transcript filtering rules file sdr2000.glm is also needed.

To generate srt files from ctm files, you will need the corresponding ndx file for each show to limit scoring to excerpts for which there is a reference transcript.

Ground Truth Data



Back to the overview.

Converting files

The following procedures using the above conversion scripts may be used to convert files from one format to another. To compute these srts, you must first copy all the index files (choose the known or unknown boundary index files accordingly) in the directory where your ctm files reside.
Then launch the following command :
find ./ -name "*.ctm" -exec ctm2srt.pl {} \; | tee ctm2srt-conv.txt

Back to the overview.

Filtering

The purpose of filtering is to eliminate excerpts of shows for which no reference transcript is available. This process also verifies that each submitted word is within an SGML-tagged Section (and deletes out-of-bound-words).
The easy way to perform this exclusionary filtering is to convert your srt back and forth to ctm using the provided scripts.

Here are the steps to do this:

  1. Create a directory named "filtered-ctm" and copy all your hyp ctms with the following commands:

  2. mkdir filtered-ctm
    cp *.ctm filtered-ctm/
  3. Copy all the known boundary index files (*.ndx) to that newly created directory as follows:

  4. cp ./ndx/*.ndx ./filtered-ctm/
  5. Perform a ctm to srt conversion which will eliminate of-out-bounds words by applying the following command:

  6. find ./ -name "*.ctm" -exec ctm2srt.pl {} \; | tee ctm2srt-conv.txt
  7. Perform a conversion back to the ctm format. The out-of-bounds words are now eliminated: find ./ -name "*.srt" -exec srt2ctm.pl {} \; | tee srt2ctm-conv.txt
At each step, check the log files to be sure no warning or error appeared.

Back to the overview.

Scoring

Warning : Large files require a lot of memory for scoring, so be careful when launching a scoring task on more than 10 hours of ASR transcripts. For instance, scoring the 1998 SDR test set (100 hours of data) requires 270 Mb. of RAM.

Scoring should be performed only on filtered files. Apply filtering as described above to all the ctm files to be scored.

Create one large concatenated ctm file for scoring.
Example: cat *.ctm | sort +0 -1 +1 -2 +2nb -3 > ../Result.ctm

The sort command makes sure the resulting ctm is sorted according to ctm requirements.
Score the recognizer-produced ctm file against the reference transcript. The scoring process may be launched using the perl script sdr-score.pl. Given a reference transcript file in stm format and a recognizer-produced hypothesis transcript file in ctm format, it will perform all the necessary checks including software version verification, some orthographic normalization, and the scoring.
Example: sdr-score.pl -r sdr98-reference.stm ./result.ctm

Back to top.

Q: I have read all the online documentation but I still have not found the answers to my questions. Who can help me?

A:
Any general questions regarding TREC should be addressed to the TREC project Manager:
Ellen Voorhees, ellen.voorhees@nist.gov
You can also refer to htpp://trec.nist.gov.

Any questions regarding the SDR Track should be addressed to the Track Organizer:
John Garofolo, john.garofolo@nist.gov (Speech)

Email discussion regarding the SDR Track can be addressed to the Track Participant List:
sdrlist@jaguar.ncsl.nist.gov

Questions regarding licensing and obtaining training and test data should be addressed to the Linguistic Data Consortium:
Shannon Sears
Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: (215) 898-0464
Fax: (215) 573-2175
Email: ldc@unagi.cis.upenn.edu
WWW: http://www.ldc.upenn.edu

Back to top.