%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%>
|
|
|
|
2000 TREC-9 Spoken Document Retrieval (SDR) Track Retrieval Test MaterialWith the exception of the test collection audio recordings which must be obtain from the Linguistic Data Consortium (LDC) on CD-ROM, this data set contains all of the test material for the TREC-9 SDR Track. Please see the TREC-9 SDR Website for specifics regarding the use of this data and pertainent dates. Note that ALL retrieval results are due by 9:00am EDT, August 14. This data set contains several versions of transcriptions of the audio test collection (human reference transcripts (1), NIST baseline recognizer transcripts(2), and site-contributed recognizer transcripts from Cambridge University, LIMSI, and Sheffield University). It also contains the 50 topics for the test in both "short" and "terse" forms(3). This data set also contains non-lexical information files automatically extracted from the audio signal contributed by the University of Cambridge(4). Please use this test material ONLY in accordance with the rules provided in the SDR Evaluation Specification. PLEASE be sure to name and format your retrieval run submissions as specified in the Evaluation Specification. There are so many conditions this year that we will have to reject your submission if it is not properly formatted and named. The following files and directories are included in this Data Set.
*: To use these data you must contact the Linguistic Data Consortium (LDC) and have previously signed a license agreement to use this broadcast news.
For your convenience, the test condition codes as defined in the Evaluation
Specification are given in the tables below along with the appropriate data to be used in the
particular condition. Unknown Story Boundaries Retrieval Conditions:
No Non-Lexical Information (Control) Retrieval Conditions:
Known Story Boundaries Retrieval Conditions:
Notes:1. Note that the human reference transcriptions have been revised this year to better reflect the form of the transcript that a perfect recognizer might provide. First, there are now no gaps in the transcripts. Commercial and non-news segments which were previously untranscribed have been filled in by NIST via ROVER using several sets of ASR transcripts from different systems. While these derived transcripts are not of the quality produced by human annotation, they exceed the accuracy of any particular existing ASR system. In addition, LIMSI has kindly computed start and end times for each word via forced alignment and has normalized the orthography to use standard representations for lexical tokens (e.g, 1 -> "one"). The result is that the human reference transcripts this year will be very similar in coverage and form to the ASR-produced transcripts. Note that because of these changes, NIST will need to update the mapping rules for recognizer scoring to correspond to the normalizations LIMSI implemented. We will notify you when we have created a new .glm file for scoring. 2. Note that this year's set of "B1" recognizer transcripts is the same as last year's set of "B2" recognizer transcripts. 3. The "short" topic forms are similar to last year's SDR topics and the terse forms are the corresponding new keyword-based forms. Note that the ID's of the short and terse forms are identical. However this information should NOT be used in any way in the test.
Page Created: August 23, 2007 |
|
Multimodal Information Group
is part of
IAD
and
ITL NIST is an agency of the U.S. Department of Commerce |
Privacy Policy |
Security Notices| Accessibility Statement | Disclaimer | FOIA |