<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • 2000 TREC-9 Spoken Document Retrieval (SDR) Track Retrieval Test Material

    With the exception of the test collection audio recordings which must be obtain from the Linguistic Data Consortium (LDC) on CD-ROM, this data set contains all of the test material for the TREC-9 SDR Track. Please see the TREC-9 SDR Website for specifics regarding the use of this data and pertainent dates. Note that ALL retrieval results are due by 9:00am EDT, August 14.

    This data set contains several versions of transcriptions of the audio test collection (human reference transcripts (1), NIST baseline recognizer transcripts(2), and site-contributed recognizer transcripts from Cambridge University, LIMSI, and Sheffield University). It also contains the 50 topics for the test in both "short" and "terse" forms(3). This data set also contains non-lexical information files automatically extracted from the audio signal contributed by the University of Cambridge(4). Please use this test material ONLY in accordance with the rules provided in the SDR Evaluation Specification.

    PLEASE be sure to name and format your retrieval run submissions as specified in the Evaluation Specification. There are so many conditions this year that we will have to reject your submission if it is not properly formatted and named. The following files and directories are included in this Data Set.

    • ir-test-material.html, this file
    • short_topics.sgml, SGML file containing short version of test topics
    • terse_topics.sgml, SGML file containing terse (keyword) version of test topics
    • SDR2000-ndx-K.tgz, archive file containing Story Boundary Known indexes
    • SDR2000-ndx-U.tgz, archive file containing Story Boundary Unknown indexes
    • SDR2000-ndx-complete.tgz, archive file containing Story Boundary indexes with news and non news story types
    • SDR2000-ref-ltt-K.tgz*, archive file containing Story Boundary Known reference transcriptions without word times
    • SDR2000-ref-srt-K.tgz*, archive file containing Story Boundary Known reference transcriptions with word times
    • SDR2000-ref-srt-U.tgz*, archive file containing Story Boundary Unknown reference transcriptions with word times
    • SDR2000-nist-b1-srt-K.tgz, archive file containing Story Boundary Known baseline recognizer transcriptions
    • SDR2000-nist-b1-srt-U.tgz, archive file containing Story Boundary Unknown baseline recognizer transcriptions
    • Xrecognizer/, directory containing site-contributed ASR transcripts for Cross Recognizer retrieval condition
    • sdt/, directory containing .sdt files with site-contributed automatically extracted non-lexical information

      The following material is to be used in scoring the results of SDR retrieval tests and should not be consulted until ALL test runs are complete
    • qrels.trec9.sdr.txt, Human-assessed relevance lists for each test topic. Used as ground truth with TREC_EVAL in scoring retrieval accuracy.
    • UIDmatch.pl, PERL script to map system output retrieval times in the story boundary uknown conditions to ground truth story IDs for TREC_EVAL scoring.
      NOTE: You must download the SDR2000-ndx-complete.tgz archive above and edit the script to modify the path pointing to the ndx files you dowloaded.
      The line you must edit in the script is well mentionned and undelined inside the script itself.

    *: To use these data you must contact the Linguistic Data Consortium (LDC) and have previously signed a license agreement to use this broadcast news.

    For your convenience, the test condition codes as defined in the Evaluation Specification are given in the tables below along with the appropriate data to be used in the particular condition.

    Unknown Story Boundaries Retrieval Conditions:

    Condition Code Transcript Set Topic Set Non Lexical Information Set
    R1SU Reference Transcripts, unknown bounds short-queries Non-Lexical Information (Optional)
    R1TU Reference Transcripts, unknown bounds terse-queries Non-Lexical Information (Optional)
    B1SU Baseline Transcripts, unknown bounds short-queries Non-Lexical Information (Optional)
    B1TU Baseline Transcripts, unknown bounds terse-queries Non-Lexical Information (Optional)
    S1SU Own SU Recognizer 1 short-queries Non-Lexical Information (Optional)
    S1TU Own SU Recognizer 1 terse-queries Non-Lexical Information (Optional)
    S2SU Own SU Recognizer 2 short-queries Non-Lexical Information (Optional)
    S2TU Own SU Recognizer 2 terse-queries Non-Lexical Information (Optional)
    CRSU Shared Recognizer Transcripts short-queries Non-Lexical Information (Optional)
    CRTU Shared Recognizer Transcripts terse-queries Non-Lexical Information (Optional)

    No Non-Lexical Information (Control) Retrieval Conditions:

    Condition Code Transcript Set Topic Set Non Lexical Information Set
    R1SUN Reference Transcripts, unknown bounds short-queries NONE
    R1TUN Reference Transcripts, unknown bounds terse-queries NONE
    B1SUN Baseline Transcripts, unknown bounds short-queries NONE
    B1TUN Baseline Transcripts, unknown bounds terse-queries NONE
    S1SUN Own SU Recognizer 1 short-queries NONE
    S1TUN Own SU Recognizer 1 terse-queries NONE
    S2SUN Own SU Recognizer 2 short-queries NONE
    S2TUN Own SU Recognizer 2 terse-queries NONE

    Known Story Boundaries Retrieval Conditions:

    Condition Code Transcript Set Topic Set Non Lexical Information Set
    R1SK Reference Transcripts (either srt or ltt versions), known bounds short-queries NONE
    R1TK Reference Transcripts (either srt or ltt versions), known bounds terse-queries NONE
    B1SK Baseline Transcripts, known bounds short-queries NONE
    B1TK Baseline Transcripts, known bounds terse-queries NONE
    S1SK Own SK Recognizer 1 short-queries NONE
    S1TK Own SK Recognizer 1 terse-queries NONE

    Notes:

    1. Note that the human reference transcriptions have been revised this year to better reflect the form of the transcript that a perfect recognizer might provide. First, there are now no gaps in the transcripts. Commercial and non-news segments which were previously untranscribed have been filled in by NIST via ROVER using several sets of ASR transcripts from different systems. While these derived transcripts are not of the quality produced by human annotation, they exceed the accuracy of any particular existing ASR system. In addition, LIMSI has kindly computed start and end times for each word via forced alignment and has normalized the orthography to use standard representations for lexical tokens (e.g, 1 -> "one"). The result is that the human reference transcripts this year will be very similar in coverage and form to the ASR-produced transcripts. Note that because of these changes, NIST will need to update the mapping rules for recognizer scoring to correspond to the normalizations LIMSI implemented. We will notify you when we have created a new .glm file for scoring.

    2. Note that this year's set of "B1" recognizer transcripts is the same as last year's set of "B2" recognizer transcripts.

    3. The "short" topic forms are similar to last year's SDR topics and the terse forms are the corresponding new keyword-based forms. Note that the ID's of the short and terse forms are identical. However this information should NOT be used in any way in the test.

    4. Cambridge University has kindly contributed a first instantiation of a set of non-lexical information (.sdt files) for the audio recordings. This information is freely available for all sites to use in their runs. However, if this information is used, sites are required to run the control conditions listed in the Evaluation Specification without the use of the non-lexical information.

     

     

    Page Created: August 23, 2007
    Last Updated: November 4, 2008

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA