2000 TREC-9 Spoken Document Retrieval (SDR) Track Evaluation Specification.

Version: 1.1

Updated: 30 June 2000

Update History

John Garofolo, Jerome Lard, Cedric Auzanne, Ellen Voorhees

This is the specification for implementation of the TREC-9 Spoken Document Retrieval (SDR) Track. For other associated documentation regarding the TREC-9 SDR Track, see http://www.nist.gov/speech/tests/sdr2000/sdr2000.htm

For information regarding other TREC-9 tracks, see the TREC Website at http://trec.nist.gov

Contents


  1. Background from TREC-8
  2. What's New and Different
  3. TREC-9 SDR Track in a Nutshell
  4. Baseline Speech Recognizer
  5. Baseline Retrieval Engine
  6. Spoken Document Test Collection
    1. Collection Documents
    2. Collection File Types
    3. Story Boundaries Conditions
      1. Unknown Story Boundaries
      2. Known Story Boundaries
  7. SDR System Date
  8. Development Test Data
  9. Speech Recognition Training/Model Generation
  10. Retrieval Training, Indexing, and Query Generation
  11. SDR Participation Conditions and Levels
  12. Evaluation Retrieval Conditions
  13. Topics (Queries)
  14. Relevance Assessments
  15. Retrieval (Indexing and Searching) Constraints
  16. Submission Formats
    1. Retrieval Submission Format
    2. Recognition Submission Format
  17. Scoring
    1. Retrieval Scoring
    2. Speech Recognition Scoring
  18. Schedule
  19. Data Licensing and Costs
  20. Reporting Conventions
  21. Contacts

Appendix A: SDR Corpus File Formats

Appendix B: SDR Corpus Filters



  1. Background from TREC-8

  2. The 1999 TREC-8 SDR evaluation succeeded in its goal of performing SDR experiments using a realistically large collection of recorded speech. The TREC-8 collection consisted of 557 hours of broadcast news recordings from February through June of 1998 from two radio and two television sources. In TREC-8, we found that SDR technology did indeed scale for such large collections. We continued experiments in Cross-Recognizer retrieval and found an almost identical recognition/retrieval degradation relationship to what we found in TREC-7.

    In TREC-8, we also began to explore the implementation and evaluation of SDR where story boundaries are unknown. We found that we could successfully implement and evaluate SDR performance using a temporal, rather than document-based approach. We found the recognition/retrieval degradation relationship to be identical to that of the story boundaries known condition, albeit with lower overall retrieval scores.

    Further details regarding the TREC-6 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr97/sdr97.txt, the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998.

    Further details regarding the TREC-7 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr98/sdr98.htm, the TREC-7 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 28-march 3, 1999.

    Further details regarding the TREC-8 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr99/sdr99.htm, and in the TREC-8 Proceedings published by NIST.

    Back to the Table of Contents.

  3. What's New and Different

  4. Test Collection

    The audio portion of the test collection for TREC-9 will be the same as was used in TREC-8. However, the human Reference transcripts are being revised to cover all of the audio files including commercials and previously-untranscribed filler segments. The previously untranscribed sections will be backfilled using the NIST ROVER algorithm over several of the submitted recognizer transcript sets. Therefore, these segments will contain some transcription errors. However, this will allow full comparability with regard to coverage between the Reference and ASR-transcribed retrieval conditions.

    There will be only one Baseline recognizer transcript set in TREC-9. The TREC-9 B1 recognizer transcript set will be the time-adaptive recognizer transcript set designated as "B2" in TREC-8

    Unknown Boundaries Primary Condition

    The Unknown Story Boundaries condition will be required for all participants. The Known Story Boundaries condition will be optional this year. This also means that sites implementing Full SDR must perform the recognition portion of the task without knowledge of story boundaries.

    Terse and Short Topics

    In addition to the sentence/phrase form of the topics used in TREC-8, we will also be generating a terse 2- or 3-word form of all of the topics for TREC-9 of the type one might use to access documents in a Web-based search engine. This will permit us to examine whether short topics are more succeptible to retrieval degradation due to recognition errors than longer ones. The terse and short forms will be given separately. So, the length of the topics will be "known". However, the two forms of the topics must be processed completely independently. All required runs and all selected optional runs must be implemented on BOTH the short and terse forms.

    Non-Lexical Information Exchange

    Currently, sites have used only the word-based transcripts generated by automatic speech recognizers as the input to their retrieval systems. However, for the Unknown Story Boundaries condition, it could be beneficial to have access to other information which can be automatically extracted from the audio signal such as speaker changes, noise changes, music changes, silence, etc. As such, we are providing a standard format for the exchange of this type of data so that sites can explore the utility of incorporating it into their search system. The non-lexical information is to be made available in Segmentation Detection Table (.sdt) files as specified in http://www.nist.gov/speech/tests/sdr/sdr2000/doc/sdt_spec.htm Sites submitting this information for sharing must format their data according to the SDT specification.

    Sites may incorporate their own and contributed non-lexical side information into their reference, baseline, speech, and cross-recognizer retrieval runs. Speech processing sites are encouraged to extract and share such information.

    Note that sites who run retrieval using such non-lexical information are required to implement contrastive reference, baseline, and speech runs retrieval without the use of non-lexical information.

    Other Required Tasks

    Since there is no central Ad Hoc task in TREC-9, sites implementing either the Full or Quasi SDR test conditions will be given full TREC participation status. No other (non-SDR) test condition will be required.

    Back to the Table of Contents.

  5. TREC-9 SDR Track in a Nutshell

  6. Training Collection:

    No particular training collection is specified or provided for this track. All previous TREC SDR training and test materials may be used for training. (A list of potential training is given in the SDR Website.) In addition, sites may make use of other training material as long as these materials are publicly available and pre-date the test collection.

    Test Collection:

    ~557-hour TDT-2 corpus subset (audio and human/asr transcripts), shared non-lexical automatically-extracted data

    Participation:

    Although other experimental conditions can be run and reported at TREC, only the above conditions may be submitted to NIST for scoring.

    Topics:

    Retrieval Conditions:

    Recognition Language Models:


    (Choice of above LM mode is at site's discretation)

    Recognition Modes:

    Primary Scoring Metrics:

    Important Dates:

    Back to the Table of Contents.

  7. Baseline Speech Recognizer

  8. As in TREC-8, a set of Baseline recognizer transcripts will be provided for retrieval sites who do not have access to recognition technology. Using these baseline recognizer transcripts, sites without recognizers can participate in the "Quasi-SDR" subset of the Track.

    Note that all sites (Full SDR and Quasi-SDR) will be required to implement retrieval runs on the baseline recognizer transcripts. This will provide a valuable "control" condition for retrieval.

    This year, one baseline recognizer transcript set will be provided by a NIST instantiation of the Rough 'N Ready BYBLOS recognition engine kindly provided by BBN. The recognizer was implemented at NIST using a time-adaptive "rolling" language model. The transcript set produced by this recognizer was identified as "B2" in the 1999 TREC-8 SDR Track. These transcripts will be identified as "B1" for the 2000 TREC-9 SDR Track. The design and implementation of this recognizer run is described in detail in the paper: Auzanne, C., Garofolo, J., Fiscus, J, Fisher, W., Automatic Language Model Adaptation for Spoken Document Retrieval, Proc. RIAO 2000, pp. 132 - 141, which is included on the SDR-2000 website.

    Back to the Table of Contents.

  9. Baseline Retrieval Engine

  10. Sites in the speech community without access to retrieval engines may use the NIST ZPRISE retrieval engine.

    see http://www-nlpir.nist.gov/works/papers/zp2/zp2.html

    Back to the Table of Contents.

  11. Spoken Document Test Collection

  12. The 2000 SDR collection is based on the broadcast news audio portion of the TDT-2 News Corpus which was originally collected by the Linguistic Data Consortium to support the DARPA Topic Detection and Tracking Evaluations. The corpus contains recordings, transcriptions, and associated data for several radio and television news sources broadcast daily between January and June 1998. The 2000 SDR Track will use the February - June subset of the TDT-2 corpus (January is excluded so as not to conflict with Hub-4 recognizers which have been trained on overlapping material from January 1998). The SDR collection consists of approximately 557 hours of recordings and contains 21,754 news stories.

    Back to the Table of Contents.

    1. Collection Documents
    2. The "documents" in the SDR Track are news stories taken from the Linguistic Data Consortium (LDC) TDT-2 Broadcast News Corpus (February - June 1998 subset), which was also used in the 1998 DARPA TDT-2 evaluation. A story is generally defined as a continuous stretch of news material with the same content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), which will have been established by hand segmentation of the news programs (at the LDC).

      There are two classifications of "stories" in the TDT-2 Corpus: "NEWS" which are topical and content rich and "MISCELLANEOUS" which are transitional filler or commercials. For the story boundaries known condition, only the "NEWS" stories will be included in the SDR collection. However, for the story boundaries unknown condition, all of the material in a broadcast will be used (including commercials and filler segments). Note that a news story is likely to involve more than one speaker, background music, noise etc.

      The news stories comprise approximately 385 of the 557 hours of the corpus. (note that this information may be used for planning purposes, but not in training story boundary unknown systems)

      The collection has been licensed, recorded and transcribed by the Linguistic Data Consortium (LDC). Since use of the collection is controlled by the LDC, all SDR participants must make arrangements directly with the LDC to obtain access to the collection. See below for contact info.

      Back to the Table of Contents.

    3. Collection File Types
    4. The test collection for the SDR Track consists of digitized NIST SPHERE-formatted waveform (*.sph) files containing recordings of news broadcasts from various radio and television sources aired between 01-February-1998 and 30-June-1998 and human-transcribed and recognizer-transcribed textual versions of the recordings.

      The test collection contains approximately 900 SPHERE-formatted waveform files of recordings of entire broadcasts. Each waveform filename consists of a basename identifying the broadcast and a .sph extension.

      The file name format is as follow: 1998MMDD-<STARTTIME>-<ENDTIME>-<NETWORKNAME>-<SHOWNAME>.sph

      E.g., the filename, 19980107-0130-0200-CNN-HDL.sph, indicates a recording of CNN Headline News taped on January 7 1998 between 1:30am and 2:00 AM.

      The following auxilary file types (with the same basename as the waveform file they correspond to) are also provided:

      Most of the filetypes in the collection will be provided by NIST through the LDC unless otherwise specified in later email.

      Back to the Table of Contents.

    5. Story Boundaries Conditions
    6. Unlike in past years, story boundaries will NOT be known for the primary Reference, Baseline, Speech, and Cross Recognizer retrieval conditions. As such, all systems will be required to implement the Story Boundaries Unknown condition for their primary retrieval runs. However, this year an optional Story Boundaries Known condition will also be supported for the Reference, Baseline, and Speech retrieval conditions. The specifications for the Unknown Story Boundaries and Known Story Boundaries conditions follow.

      Back to the Table of Contents.

      1. Unknown Story Boundaries Condition (required)
      2. This condition is being implemented to investigate the retrieval of audio excerpts where story boundaries are unknown. As such, no <Section> tags will be given for use in this condition. Full-SDR participants in this condition must recognize entire broadcasts. The object of this task is for retrieval systems to to emit a single time impulse for each relevant story. As such, retrieval systems will emit time-based IDs consisting of the broadcast ID plus a time. In creating the reference transcripts for this condition, previously untranscribed sections of the recordings such as commercials or filler segments will be automatically transcribed using the NIST ROVER software across several of the submitted recognizer transcript sets. The ROVER algorithm combines the output of multiple recognizers to create a more optimal transcript. We therefore expect the error rate for these sections to be slightly higher than the closed captions, but this will permit entire broadcasts to be indexed in the reference retrieval condition as well as relevance assessment over the entire broadcasts.

        A Time ID consists of 2 fields separated by a column (:)

        • the show ID (for instance 19980104_1130_1200_CNN_HDL)
        • a time stamp to hundreds of a second (for instance 13.45 is 13 seconds and 45/100 of a second)

        (e.g., 19980104_1130_1200_CNN_HDL:13.45)

        In scoring, the TimeIDs will be mapped to Story IDs and duplicates will be eliminated. (See the Retrieval Scoring section for more details on the processing of this condition.)

        Back to the Table of Contents.

      3. Known Story Boundaries Condition (optional)
      4. For this condition, the temporal boundary of news stories in the collection will be "known". Boundary times are given in the SGML <Section> tags contained in the Index (*.ndx) files as well as in the LTT and SRT transcript filetypes. The <Section> tags specify the document IDs and start and end times for each story within the collection.

        Note that sections of the waveform files containing commercials and other "out of bounds" material which have not be transcribed by the LDC will be exluded from retrieval in the Known Story Boundary Condition. The NDX files for this condition will indicate the proper subset of the corpus to be indexed and retrieved.

        Note: Recognition systems developed for the Known Story Boundaries condition may use the story boundary information in segmenting the recordings and in skipping non-news segments. However, participants are encouraged to implement recognition in conformance with the rules for the Uknown Story Boundaries condition (recognize entire broadcast files and ignore the story boundaries for the recognition portion of the task) so that these transcripts can be used for both the Known and Unknown Story Boundaries conditions. NIST will supply a script to create a filtered copy of whole-broadcast recognized transcripts to add embedded story boundaries and remove non-news material so that these can be used for the Known Story Boundaries condition.

        Note that except for the time boundaries and Story IDs provided in the <Section> tags, NO OTHER INFORMATION provided in the SGML tags may be used or indexed in any way for the test including any classification or topic information which may be present.

    Back to the Table of Contents.

  13. SDR System Date

  14. This is an online recognition/retrospective retrieval task. As such, two speech recognition modes are permitted - each with system date rules:

    See Section 10 for details regarding these modes and acoustic and language model training requirements.

    The retrieval system date is July 1, 1998.

    Back to the Table of Contents.

  15. Development Test Data

  16. No Development Test data is specified or provided for the SDR track although this year's training set may be split into the training/test sets used last year for development test purposes.

    Back to the Table of Contents.

  17. Speech Recognition Training/Model Generation

  18. The 200 hours of LDC Broadcast News data collected in 1996, 1997 and January 1998 is designated as the suggested training material for the 1999 SDR evaluation. This, however, does not preclude the use of other training materials as long as they conform to the restrictions listed later in this section. There is no designated supplementary textual data for SDR language model training. However, sites are encouraged to explore the development of rolling language models using the NewsWire data provided in the TDT-2 corpus. Sites may choose either a "fixed" or "rolling" language model mode as described below for each of their recognition runs.

    "Fixed" language model/vocabulary (FLM) systems: This is the traditional speech recognition evaluation mode in which systems implement fixed (non-time-adaptive) language models for recognition. If sites are implementing this recognition model, for all intents and purposes, the fixed recognition date for this evaluation will be 31 January 1998. Therefore, no acoustic or textual materials broadcast or published after this date may be used in developing either the recognition or retrieval system component. These systems will be referred to as Fixed Language Model (FLM) systems and will be dated 31 January 1998.

    "Rolling" language model/vocabulary (RLM) systems: This option is supported to investigate the utility of using automatically-adapted evolving language models/vocabularies for recognition in temporal applications. These systems are permitted to use newswire data (not broadcast transcripts) from previous data days to automatically adapt their language models and vocabularies to implement recognition for the current day. For example, sites are permitted to use newswire material from March 17, 1998 to recognize audio material recorded on March 18, 1998. These systems will be referred to as Rolling Language Model (RLM) systems. The TDT-2 newsire portion of the corpus is available to support this mode. The TDT-2 newswire corpus contains approximately the same number of stories as the audio portion and was collected over the same time period.

    Sites are permitted to investigate less frequent adapation schemes (e.g., weekly, monthly, etc.) so long as the material used for adapation always predates the current data day by at least one day.

    Two recognition segmentation modes are included to support the story-boundaries-known and -unknown retrieval conditions:

    For story-boundaries-unknown (SU) systems (required): Systems may not use story boundary timing information and must perform recognition on entire broadcasts as specified in the story boundaries unknown index files. Systems are permitted to attempt to AUTOMATICALLY screen out non-news sections such as commercials, but no manual segmentation may be used. These transcripts may be converted into SK-type transcripts with NIST-supplied software and may, therefore, be used for both story-boundaries-unknown and -known retrieval conditions.

    For story-boundaries-known (SK) systems (optional): Systems may make use of story boundary timing information for segmentation purposes. They may also ignore non-news sections. However, this recognition mode is discouraged since the transcripts provided by this mode may not be used in story-boundaries-unknown retrieval conditions. All sites are encouraged to implement recognition of whole broadcasts without story boundaries for the Story Boundaries Unknown condition.

    The following general rules apply to training for all recognition modes:

    1. No acoustic or transcription material from radio or television news sources broadcast after 31-JAN-98 other than from the SDR99 test collection may be used for any purpose.
    2. No manual transcriptions of broadcast excerpts appearing in the SDR99 test collection may be used for acoustic or language model training.
    3. All material used for language model training/adaptation must predate (non-inclusive) the broadcast date of the episode to be recognized.
    4. All material used for acoustic model training/adaptation must be contemporaneous with (inclusive) or predate the broadcast date of the episode to be recognized.
    5. Any other acoustic or textual data not excluded above such as newswire texts, Web articles, etc. published prior to the day of the episode to be transcribed may be used for training/adaptation.

    The granularity for adaptation for recognition is 1 day. The time of day that an episode (or an excerpt within an episode) was broadcast can be ignored. During recognition of episodes from the "current" day, only language model training data collected up through the "previous" day may be used. However, material for unsupervised acoustic model adaptation from the current day may be used. This implies that audio material to be recognized from the current day may be processed in any order using any adaptation scheme permitted by the above rules.

    Note: "Current" refers to the date the episode to be recognized was broadcasted.

    Sites are requested to report the training materials and adapation modes they employed in their site reports and TREC papers.

    All acoustic and textual materials used in training must be publicly available at the time of the start of the evaluation.

    Back to the Table of Contents.

  19. Retrieval Training, Indexing, and Query Generation

  20. The SDR track is an automatic ad hoc retrospective retrieval task. This means both that any collection-wide statistics may be used in indexing and that the retrieval system may NOT be tuned using the test topics. Participants may not use statistics generated from the reference transcripts collection in the baseline or recognizer transcript collections. Any auxiliary IR training material or auxiliary data structures such as thesauri that are used must predate the 01-JUL-1998 retrieval date. Likewise, any IR training material which is derived from spoken broadcast sources (transcripts) must predate the test collection (prior to 31-JAN-1998).

    All sites are required to implement fully automatic retrieval. Therefore, sites may not perform manual query generation in implementing retrieval for their submitted results.

    Participants are, of course, free to perform whatever side experiments they like and report these at TREC as contrasts.

    For retrieval training purposes, the 1998 SDR-TREC-7 data is available as a set of 23 topics and relevance judgements and the 1999 SDR-TREC-8 data is available as a set of 49 topics and relevance judgements.

    Back to the Table of Contents.

  21. SDR Participation Conditions and Levels

  22. Interested sites are requested to register for the SDR Track as soon as possible. Registration in this case merely indicates your interest and does not imply a committment to participate in the track. Participants must register via the TREC Call for Participation Website at http://trec.nist.gov/cfp.html

    Since this is a TREC track, participants are subject to the TREC conditions for participation, including signing licensing agreements for the data. Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but specific advertising claims based on TREC results is forbidden. The conference held in November is open only to participating groups that submit results and to government sponsors. (Signed-up participants should have received more detailed guidelines.)

    All SDR participants will be permitted to attend the TREC-9 conference.

    Participants must implement either Full SDR or Quasi-SDR retrieval as defined below. Note that sites may not participate by simply pipelining the baseline recognizer transcripts and baseline retrieval engine. Participants should implement at least one of the two major system components. As in previous SDR trakcs, sites with speech recognition expertise and sites with retrieval expertise are encouraged to team up to implement Full SDR.

    The 2000 TREC-9 SDR Track has two participation levels and several retrieval conditions as detailed below. Given the large number of conditions this year, sites are permitted to submit only 1 run per condition.

    Participation Levels:

    1. Full SDR Required Retrieval Runs: S1SU,S1TU,B1SU,B1TU,R1SU,R1TU(see below)
      Sites choosing to participate in Full SDR must produce a ranked time pointer list for each test topic from the recorded audio waveforms. This participation level requires the implementation of both speech recognition and retrieval. In addition, Full SDR participants must implement the Story Boundaries Unknown Baseline and Reference retrieval conditions. Participants may submit an optional second Full SDR run using an alternate recognizer (see below for requirements). Participants may also submit optional Cross-Recognizer runs and Story-Boundaries-Known runs. Sites are also encouraged to explore the use of automatically-generated non-lexical information extracted from the audio signal to assist segmentation and retrieval. If such information is used, control runs without such information must also be implemented.
    2. Quasi-SDR Required Retrieval Runs: B1SU,B1TU,R1SU,R1TU (see below)
      Sites without access to speech recognition technology may participate in the "Quasi-SDR" subset of the test by implementing retrieval on provided recognizer-produced transcripts. In addition, Quasi-SDR participants must implement the Reference retrieval condition. Participants may submit optional Cross-Recognizer runs and Story-Boundaries-Known runs. Sites are also encouraged to explore the use of automatically-generated non-lexical information extracted from the audio signal to assist segmentation and retrieval. If such information is used, control runs without such information must also be implemented.

    Back to the Table of Contents.

  23. Evaluation Retrieval Conditions

  24. The following are the retrieval conditions for the SDR Track. Note that some retrieval conditions are required and others are optional.

    Participants MUST use the SAME retrieval strategy for all conditions (that is, term weighting method, stop word list, use of phrases, retrieval model, etc must remain constant). For sites implementing S1 and/or S2 using non-word-based recognition (phone, word-spotting, lattice, etc.), they should use the closest retrieval strategy possible across conditions.

    Sites may not use Word Error Rate or other measures as generated by scoring the recognizer transcripts against the reference transcripts to tune their retrieval algorithms in the S* or B* and CR* retrieval conditions (all conditions where recognized transcripts are used). The reference transcripts may not be used in any form for any retrieval condition except of course for R1.

    Back to the Table of Contents.

  25. Topics (Queries)

  26. The TREC-9 SDR Track will have 50 topics (queries) constructed by the NIST assessors. Each topic will be presented in two forms: A "short" form in which the topic is presented in a concise one or two sentence/phrase; A "terse" form in which the topic is presented in a two or three-word "Web-query" form. The topics will be clustered by type with the 50 "short" forms presented first followed by the second 50 "terse" forms. No knowledge of any relation between the short and terse forms should be used. They should be processed completely independently.

    Short Examples:
    What countries have been accused of human right violations?

    Find reports of fatal air crashes.

    What are the latest developments in gun control in the U.S.?
    In particular, what measures are being taken to protect children from guns?

    Terse Examples:
    "human rights" violations countries

    fatal airline crash

    gun control "U.S."

    For SDR, the search topics must be processed automatically, without any manual intervention. Note that participants are welcome to run their own manually-assisted contrastive runs and report on these at TREC. However, these will not be scored or reported by NIST.

    The topics will be supplied in written form. Given the number of retrieval conditions this year, spoken versions of the queries will not be included as part of the test set. However, participants are welcome to run their own contrastive spoken input tests and report on these at TREC.

    Back to the Table of Contents.

  27. Relevance Assessments

  28. Relevance assessments for the SDR Track will be provided by the NIST assessors. As in the Adhoc Task, the top 100-ranked documents for each topic from each system for the Reference Condition (R1U) will be pooled and evaluated for relevance. If time and resources permit, additional documents from other retrieval conditions may be added to the pool as well. Note that this approach is employed to make the assessment task manageable, but may not cover all documents that are relevant to the topics.

    Note that the same relevance judgements will be used to score both the short form of the topics and the corresponding terse forms.

    Back to the Table of Contents.

  29. Retrieval (Indexing and Searching) Constraints

  30. Since the focus of the SDR Track is on the automatic retrieval of spoken documents, manual indexing of documents, manual construction or modification of search topics, and manual relevance feedback may not be used in implementing retrieval runs for scoring by NIST. All submitted retrieval runs must be fully automatic. Note that fully automatic "blind" feedback and similar techniques are permissible and manually-produced reference data such as dictionaries and thesauri may be employed. Note the training and training date constraints specified in Section 10.

    Participants are free to perform internal experiments with manual intervention and report on these at TREC.

    Back to the Table of Contents.

  31. Submission Formats

  32. In order for NIST to automatically log and process all of the many submissions which we expect to receive for this track, participants MUST ensure that their retrieval and recognition submissions meet the following filename and content specs. Incorrectly formatted files will be rejected by NIST.

    1. Retrieval Submission Format
    2. For retrieval, each submission must have a filename of the following form: <SITE_ID>-<CONDITION>-<RECOGNIZER_ID>.ret where,

      The following are some example retrieval submission filenames: eth-r1su.ret (ETH retrieval using reference transcripts, short queries, with no boundaries)
      cmu-b1tu.ret (CMU retrieval using Baseline 1 recognizer, terse queries, with no boundaries)
      shef-b1sk.ret (Sheffield retrieval using Baseline 1 recognizer, short queries, with boundaries)
      att-s1su-att1u.ret (AT&T retrieval, short topics, using AT&T 1 recognizer with no boundaries)
      ibm-crtu-att1u.ret (IBM retrieval, terse topics, using AT&T 1 recognizer with no boundaries)

      As in other TREC tracks, for the story-boundaries-known condition the output of a retrieval run is a ranked list of story (document) ids as identified in the NDX files and <Section> tags in the R1 and B1 transcripts. These will be submitted to NIST for scoring using the standard TREC submission format (a space-delimited table):
      23 Q0 19980104_1130_1200_CNN_HDL.0034 1 4238 ibm-b1sk
      23 Q0 19980105_1800_1830_ABC_WNT.0143 2 4223 ibm-b1sk
      23 Q0 19980105_1130_1200_CNN_HDL.1120 3 4207 ibm-b1sk
      23 Q0 19980515_1630_1700_CNN_HDL.0749 4 4194 ibm-b1sk
      23 Q0 19980303_1600_1700_VOA_WRP.0061 5 4189 ibm-b1sk
      etc.

      Field Content:

      1. Topic ID
      2. Currently unused (must be "Q0")
      3. Story ID of retrieved document
      4. Document rank
      5. *Retrieval system score (INT or FP) which generated the rank.
      6. Site/Run ID (should be same as file basename)

      The Story IDs are given in the Section (story boundary) tags.

      *Note that field 5 MUST be in descending order so that ties may be handled properly. This number (not the rank) will be used to rank the documents prior to scoring. The site-given ranks will be ignored by the 'trec_eval' scoring software.

      Participants may submit lists with more than 1000 documents for each topic. However, NIST will truncate the list to 1000 topics.

      For the story-boundaries-unknown condition, field 3 will be a episode/ time tag of the form: <Episode-ID>:<Time-in-Seconds.Hundredths> for the retrieved excerpt:
      23 Q0 19980104_1130_1200_CNN_HDL:39.52 1 4238 ibm-crsu-att1u
      23 Q0 19980105_1800_1830_ABC_WNT:143.69 2 4223 ibm-crsu-att1u
      23 Q0 19980105_1130_1200_CNN_HDL:1120.02 3 4207 ibm-crsu-att1u
      23 Q0 19980515_1630_1700_CNN_HDL:749.81 4 4194 ibm-crsu-att1u
      23 Q0 19980303_1600_1700_VOA_WRP:61.02 5 4189 ibm-crsu-att1u
      etc.

      Sites are to submit their retrieval output to NIST for scoring using standard TREC procedures and ftp protocols. See the TREC Website at http://trec.nist.gov for more details.

      Back to the Table of Contents.

    3. Recognition Submission Format
    4. As in last year's SDR Track, only 1-Best-algorithm recognizer transcripts will be accepted by NIST for scoring and if received in time, will be shared across sites for the Cross-Recognizer retrieval conditions. Sites performing Full-SDR not using a 1-Best recognizer are encouraged to self-evaluate their recognizer in their TREC paper.

      Since the concept of recognizer transcript sharing for Cross-Recognizer Retrieval experiments appeared to be broadly accepted last year, NIST will assume that all submitted recognizer transcripts are to be scored and made available to other participants for Cross-Recognizer Retrieval. If you would like to submit your recognizer transcripts for scoring, but do NOT want them shared, you must notify NIST (jerome.lard@nist.gov) of the system/run to exclude from sharing PRIOR to submission.

      Submitted 1-Best recognizer transcripts must be formatted as follows: Each recognizer transcript (one per show) is to have a filename of the following form: <EPISODE>.srt where,

      A System Description file must be created for each submitted set of recognizer-produced transcripts which outlines pertinent features of the recognition system used. The file should be named: <RECOGNIZER><RUN>.desc where,

      Minimally, the system description MUST identify the language model mode which was employed: "Fixed" or "Rolling". If a rolling language model was used, the update period should be identified.

      The format for the System Description is as follows:
      System ID: (eg, NIST1U)

      1. SYSTEM DESCRIPTION:
      2. ACOUSTIC TRAINING:
      3. GRAMMAR TRAINING: (e.g., Fixed or Rolling with N-Day Periodic Update)
      4. RECOGNITION LEXICON DESCRIPTION:
      5. DIFFERENCES FROM S1 (if S2):
      6. REFERENCES:

      The SRT files and System Description File should be placed in a directory with the following name: <RECOGNIZER><RUN> where,

      Submit your SRT files as follows:

      A gnu-zipped tar archive of the above directory should then be created (e.g., att1u.tgz) using the -cvzf options in GNU tar. This file can now be submitted to NIST for scoring/sharing via anonymous ftp to jaguar.ncsl.nist.gov using your email address as the password. Once you are logged in, cd to the "incoming/sdr2000" directory. Set your mode to binary and "put" the file. This is a "blind" directory, so you will not be able to "ls" your file. Once you have uploaded the file, send email to jerome.lard@nist.gov to indicate that a file is waiting. He will send you a confirmation after the file is successfully extracted and email again later with your SCLITE scores. To keep things simple and file sizes down, please submit separate runs (s1 and s2) in separate tgz files.

      The submitted output of a 1-Best recognizer must be in the standard SDR Speech Recognizer Trancription (SRT) format. See Appendix A for an example.

    Back to the Table of Contents.

  33. Scoring

    1. Retrieval Scoring
    2. The TREC-9 SDR Track retrieval performance will be scored using the NIST "trec_eval" Precision/Recall scoring software. A "shar" file containing the trec_eval software is available via anonymous ftp from the following URL:
      ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar

      For TREC-9 SDR, the primary retrieval measure will be Mean Average Precision. Other retrieval measures will include: Precision at standard Document rank cutoff levels, single number Average Precision over all relevant documents, and single number R-Precision, precision after R relevant documents retrieved.

      These measures are defined in Appendices to TREC Proceedings, and may also be found on the TREC Website at http://trec.nist.gov.

      For the known-story-boundaries condition, NIST will truncate the submitted list to 1000 documents and score it using trec_eval.

      For the story-boundaries-unknown (U) retrieval conditions, NIST will programmatically do the following :

      1. truncate the list to 1000 documents.

      2. map all time tags to unique story IDs. Note that ALL of the recorded time in the collection will have assigned story IDs, including both legitimate retrievable stories and non-stories such as commercials, filler, etc. If a lower ranked story ID is a duplicate of a higher ranked story ID, then a sequence will be appended to the duplicate (e. g., .1). All of these duplicates will therefore be scored as non-relevant. This same procedure will be applied to both story and non-story material. Therefore, duplication of "hits" within stories and non-stories will be equally penalized.

      3. score using trec_eval.

      The mapping will simply involve converting the time tag to the story ID of the story that the identified time resides within.

      NIST will provide a mapping/filtering tool to implement (1 - 2): UIDmatch.pl - convert time-based retrieval output to doc-based output format for trec_eval scoring. See the IR scoring tools page.

      Back to the Table of Contents.

    3. Speech Recognition Scoring
    4. TREC-9 Full SDR participants who use 1-best recognition are encouraged to submit their recognizer transcripts in SRT form for scoring by NIST. NIST will employ its "sclite" Speech Recognition Scoring Software to benchmark the Story Word Error Rate for each submission. These scores will be used to examine the relationship between recognition error rate and retrieval performance. Note that to ensure consistency among all forms of the evaluation collection, all SRTs submitted for the Story Boundaries Known retrieval conditions received will be filtered to remove any speech outside the evaluation per the corresponding NDX files.

      A randomly-selected 10-hour subset of the SDR collection will be transcribed in Hub-4 form so that the speech recognition transcripts can be scored. This will provide the primary speech recognition measures for the SDR track.

      The NIST SCLITE Scoring software is available via the following URL: http://www.nist.gov/speech/software.htm. This page contains a ftp-able link to SCTK, the NIST Speech Recognition Scoring Toolkit which contains SCLITE. The SCLITE software may be updated to accomodate large test sets. The SDR email list will be notified as updates become available.

      NIST will provide the following additional scripts to permit useful transformations of the SDR speech recognizer transcripts:

      See the Speech scoring tools page.

      Note that two forms of NDX files will be provided: 1 set for the Story Boundaries Known (SK) condition and another set for the Story Boundaries Unknown (SU) Condition. The ctm2srt.pl filter and SK NDX file can be used with a CTM file created for the SU condition to create an SRT file for the SK condition as follows:

    N O T E

    Since unverified reference transcripts are used in the SDR Track, the SDR Word Error Rates should not be directly compared to those for Hub-4 evaluations which use carefully checked/corrected annotations and special orthographic mapping files.

    Back to the Table of Contents.

  34. Schedule

  35. Site registration: ASAP
    SPH and NDX available
    (recognition task begins)
    02 Apr 2000
    Recognizer transcripts (SRTs) due for scoring/sharing 21 Jun 2000
    Non-lexical information files (SDT) due for sharing 21 Jun 2000
    NDXs, LTTs, SRTs, topics available
    (all retrieval tasks begin)
    30 Jun 2000
    All search results due at NIST 14 Aug 2000 9am EDT
    Relevance judgements released by NIST 02 Oct 2000
    Scored Retrieval Results released by NIST 02 Oct 2000
    Conference workbook papers to NIST 25 Oct 2000 (estimated)
    TREC-9 Conference 13-16 Nov 2000

    Back to the Table of Contents.

  36. Data Licensing and Costs

  37. Participants must make arrangements with the Linguistic Data Consortium to obtain use of the TDT-2 recorded audio and transcriptions used in the SDR Track. The recorded audio data is available in Shorten-compressed form on approximately 75 CD-ROMs. The transcription and associated textual data will be made available via ftp or via CD-ROM by special request.

    The contact at the LDC for obtaining access to SDR corpora is:
    Shannon Sears
    Linguistic Data Consortium
    3615 Market Street
    Suite 200
    Philadelphia, PA 19104-2608
    Phone: (215) 898-0464
    Fax: (215) 573-2175
    Email: ldc@unagi.cis.upenn.edu
    WWW: http://www.ldc.upenn.edu

    Several Licensing and pricing options are available to SDR Track participants for access to the TDT-2 corpus:

    Test-Only sites must sign a TREC-9 SDR "evaluation only" license agreement and are required to return the discs and delete all waveform/transcript files from their systems within 30 days after completion of the evaluation or be charged the for-profit subscription fee.

    ALL SITES MUST SIGN LICENSE AGREEMENTS AVAILABLE FROM THE LDC TO OBTAIN ACCESS TO THE SDR DATA.

    Specifics regarding obtaining the data sets for the SDR Track will be made available later in email.

    Back to the Table of Contents.

  38. Reporting Conventions

  39. Participants are asked to give full details in their Workbook/Proceedings papers of the resources used at each stage of processing, as well as details of their SR and IR methods. Participants not using 1-best recognizers for Full-SDR should also provide an appropriate analysis of the performance of the recognition algorithm they used and its effect on retrieval.

    Back to the Table of Contents.

  40. Contacts

  41. Any general questions regarding TREC should be addressed to the TREC project Manager:
    Ellen Voorhees, ellen.voorhees@nist.gov
    You can also refer to htpp://trec.nist.gov.

    Any questions regarding the SDR Track should be addressed to the Track Organizer:
    John Garofolo, john.garofolo@nist.gov (Speech)

    Email discussion regarding the SDR Track can be addressed to the Track Participant List:
    sdr_list@nist.gov

    Questions regarding licensing and obtaining training and test data should be addressed to the Linguistic Data Consortium:
    Shannon Sears
    Linguistic Data Consortium
    3615 Market Street
    Suite 200
    Philadelphia, PA 19104-2608
    Phone: (215) 898-0464
    Fax: (215) 573-2175
    Email: ldc@unagi.cis.upenn.edu
    WWW: http://www.ldc.upenn.edu

    Back to the Table of Contents.



    APPENDIX A: SDR Corpus File Formats



    Note: All transcription files are SGML-tagged.


    .sph - SPHERE waveform: SPHERE-formatted digitized recording of a broadcast, used as input to speech recognition systems. Waveform format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order.

    NIST_1A
    1024
    sample_count -i 27444801
    sample_rate -i 16000
    channel_count -i 1
    sample_byte_format -s2 10
    sample_n_bytes -i 2
    sample_coding -s3 pcm
    sample_min -i -27065
    sample_max -i 27159
    sample_checksum -i 31575
    database_id -s7 Hub4_96
    broadcast_id NPR_MKP_960913_1830_1900
    sample_sig_bits -i 16
    end_head
    (digitized 16-bit waveform follows header)
    .
    .
    .


    .ltt - Lexical TREC Transcription: ASR-style reference transcription with all SGML tags removed except for Episode and Section. "Non-News" Sections are excluded. This format is used as the source for the Reference Retrieval condition.

    <Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
    <Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
    it's friday september thirteenth i'm david brancaccio and here's some of what's happening in business and the world
    </Section>
    <Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
    agricultural products giant archer daniels midland is often described as politically well connected any connections notwithstanding the federal government is pursuing a probe into whether the company conspired to fix the price of a key additive for livestock feed
    ...
    </Section>
    ...
    </Episode>


    .ndx - Index: Specifies <Sections> in waveform and establishes story boundaries and ID's. Similar to LTT format without text. Non-transcribed Sections are excluded.

    For the known story boundaries condition, the ndx format will require one Section tag per story as follows:

    <Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
    <Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
    <Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
    ...
    </Episode>

    For the unknown story boundaries condition, the ndx format will require a single "FAKE" Section tag that will encompass the entire Episode as follows:

    <Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
    <Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
    </Episode>

    Note that the start time of the fake section is the start time of the first NEWS story and the end time is the end time of the last NEWS story of the show.


    .srt - Speech Recogniser Transcript (contrived example): Output of speech recogniser for a .sph recorded waveform file which will be used as input for retrieval. Each file must contain an <Episode> tag and properly interleaved <Section> tags taken from the corresponding .ndx file. Each <Word> tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word.

    For the known story boundaries condition, the Section tags follow the ones specified in the ndx file.

    <Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
    <Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
    <Word S_time=75.52 E_time=75.87>his</Word>
    <Word S_time=75.87 E_time=75.36>friday'S</Word>
    <Word S_time=76.36 E_time=76.82>september</Word>
    <Word S_time=76.82 E_time=77.47>thirteenth</Word>
    ...
    </Section>
    ...
    </Episode>

    For the unknown story boundaries condition, the srt format will require a single "null" Section tag that will encompass the entire Episode as follows:

    <Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
    <Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
    ...
    <Word S_time=75.52 E_time=75.87>his</Word>
    <Word S_time=75.87 E_time=75.36>friday'S</Word>
    <Word S_time=76.36 E_time=76.82>september</Word>
    <Word S_time=76.82 E_time=77.47>thirteenth</Word>
    ...
    </Section>
    </Episode>

    Back to the Table of Contents.



    APPENDIX B: SDR Corpus Filters



    srt2ltt.pl This filter transforms the Speech Recognizer Transcription (SRT) format with word times into the Lexical TREC Transcription (LTT) form. This resulting simplified form of the speech recogniser transcription can be used for retrieval if word times are not desired.
    srt2ctm.pl This filter transforms the Speech Recognizer Transcription (SRT) format into the CTM format used by the NIST SCLITE Speech Recognition Scoring Software.
    ctm2srt.pl This filter together with the corresponding NDX file transforms the CTM format used by the NIST SCLITE Speech Recognition Scoring Software into the SDR Speech Recognizer Transcription (SRT) format. Material not specified in the NDX time tags is excluded.

    Back to the Table of Contents.