Version: 1.1
Updated: 30 June 2000
John Garofolo, Jerome Lard, Cedric Auzanne, Ellen Voorhees
This is the specification for implementation of the TREC-9 Spoken Document Retrieval (SDR) Track. For other associated documentation regarding the TREC-9 SDR Track, see http://www.nist.gov/speech/tests/sdr2000/sdr2000.htm
For information regarding other TREC-9 tracks, see the TREC Website at http://trec.nist.gov
Appendix A: SDR Corpus File Formats
Appendix B: SDR Corpus Filters
The 1999 TREC-8 SDR evaluation succeeded in its goal of performing SDR experiments using a realistically large collection of recorded speech. The TREC-8 collection consisted of 557 hours of broadcast news recordings from February through June of 1998 from two radio and two television sources. In TREC-8, we found that SDR technology did indeed scale for such large collections. We continued experiments in Cross-Recognizer retrieval and found an almost identical recognition/retrieval degradation relationship to what we found in TREC-7.
In TREC-8, we also began to explore the implementation and evaluation of SDR where story boundaries are unknown. We found that we could successfully implement and evaluate SDR performance using a temporal, rather than document-based approach. We found the recognition/retrieval degradation relationship to be identical to that of the story boundaries known condition, albeit with lower overall retrieval scores.
Further details regarding the TREC-6 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr97/sdr97.txt, the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998.
Further details regarding the TREC-7 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr98/sdr98.htm, the TREC-7 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 28-march 3, 1999.
Further details regarding the TREC-8 SDR Track can be obtained from the track specification at http://www.nist.gov/speech/tests/sdr/sdr99/sdr99.htm, and in the TREC-8 Proceedings published by NIST.
Back to the Table of Contents.
There will be only one Baseline recognizer transcript set in TREC-9. The TREC-9 B1 recognizer transcript set will be the time-adaptive recognizer transcript set designated as "B2" in TREC-8
Sites may incorporate their own and contributed non-lexical side information into their reference, baseline, speech, and cross-recognizer retrieval runs. Speech processing sites are encouraged to extract and share such information.
Note that sites who run retrieval using such non-lexical information are required to implement contrastive reference, baseline, and speech runs retrieval without the use of non-lexical information.
Back to the Table of Contents.
No particular training collection is specified or provided for this track. All previous TREC SDR training and test materials may be used for training. (A list of potential training is given in the SDR Website.) In addition, sites may make use of other training material as long as these materials are publicly available and pre-date the test collection.
~557-hour TDT-2 corpus subset (audio and human/asr transcripts), shared non-lexical automatically-extracted data
Back to the Table of Contents.
As in TREC-8, a set of Baseline recognizer transcripts will be provided for retrieval sites who do not have access to recognition technology. Using these baseline recognizer transcripts, sites without recognizers can participate in the "Quasi-SDR" subset of the Track.
Note that all sites (Full SDR and Quasi-SDR) will be required to implement retrieval runs on the baseline recognizer transcripts. This will provide a valuable "control" condition for retrieval.
This year, one baseline recognizer transcript set will be provided by a NIST instantiation of the Rough 'N Ready BYBLOS recognition engine kindly provided by BBN. The recognizer was implemented at NIST using a time-adaptive "rolling" language model. The transcript set produced by this recognizer was identified as "B2" in the 1999 TREC-8 SDR Track. These transcripts will be identified as "B1" for the 2000 TREC-9 SDR Track. The design and implementation of this recognizer run is described in detail in the paper: Auzanne, C., Garofolo, J., Fiscus, J, Fisher, W., Automatic Language Model Adaptation for Spoken Document Retrieval, Proc. RIAO 2000, pp. 132 - 141, which is included on the SDR-2000 website.
Back to the Table of Contents.
Sites in the speech community without access to retrieval engines may use the NIST ZPRISE retrieval engine.
see http://www-nlpir.nist.gov/works/papers/zp2/zp2.html
Back to the Table of Contents.
The 2000 SDR collection is based on the broadcast news audio portion of the TDT-2 News Corpus which was originally collected by the Linguistic Data Consortium to support the DARPA Topic Detection and Tracking Evaluations. The corpus contains recordings, transcriptions, and associated data for several radio and television news sources broadcast daily between January and June 1998. The 2000 SDR Track will use the February - June subset of the TDT-2 corpus (January is excluded so as not to conflict with Hub-4 recognizers which have been trained on overlapping material from January 1998). The SDR collection consists of approximately 557 hours of recordings and contains 21,754 news stories.
Back to the Table of Contents.
The "documents" in the SDR Track are news stories taken from the Linguistic Data Consortium (LDC) TDT-2 Broadcast News Corpus (February - June 1998 subset), which was also used in the 1998 DARPA TDT-2 evaluation. A story is generally defined as a continuous stretch of news material with the same content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), which will have been established by hand segmentation of the news programs (at the LDC).
There are two classifications of "stories" in the TDT-2 Corpus: "NEWS" which are topical and content rich and "MISCELLANEOUS" which are transitional filler or commercials. For the story boundaries known condition, only the "NEWS" stories will be included in the SDR collection. However, for the story boundaries unknown condition, all of the material in a broadcast will be used (including commercials and filler segments). Note that a news story is likely to involve more than one speaker, background music, noise etc.
The news stories comprise approximately 385 of the 557 hours of the corpus. (note that this information may be used for planning purposes, but not in training story boundary unknown systems)
The collection has been licensed, recorded and transcribed by the Linguistic Data Consortium (LDC). Since use of the collection is controlled by the LDC, all SDR participants must make arrangements directly with the LDC to obtain access to the collection. See below for contact info.
Back to the Table of Contents.
The test collection for the SDR Track consists of digitized NIST SPHERE-formatted waveform (*.sph) files containing recordings of news broadcasts from various radio and television sources aired between 01-February-1998 and 30-June-1998 and human-transcribed and recognizer-transcribed textual versions of the recordings.
The test collection contains approximately 900 SPHERE-formatted waveform files of recordings of entire broadcasts. Each waveform filename consists of a basename identifying the broadcast and a .sph extension.
The file name format is as follow: 1998MMDD-<STARTTIME>-<ENDTIME>-<NETWORKNAME>-<SHOWNAME>.sph
E.g., the filename, 19980107-0130-0200-CNN-HDL.sph, indicates a recording of CNN Headline News taped on January 7 1998 between 1:30am and 2:00 AM.
The following auxilary file types (with the same basename as the waveform file they correspond to) are also provided:
Most of the filetypes in the collection will be provided by NIST through the LDC unless otherwise specified in later email.
Back to the Table of Contents.
Unlike in past years, story boundaries will NOT be known for the primary Reference, Baseline, Speech, and Cross Recognizer retrieval conditions. As such, all systems will be required to implement the Story Boundaries Unknown condition for their primary retrieval runs. However, this year an optional Story Boundaries Known condition will also be supported for the Reference, Baseline, and Speech retrieval conditions. The specifications for the Unknown Story Boundaries and Known Story Boundaries conditions follow.
Back to the Table of Contents.
This condition is being implemented to investigate the retrieval of audio excerpts where story boundaries are unknown. As such, no <Section> tags will be given for use in this condition. Full-SDR participants in this condition must recognize entire broadcasts. The object of this task is for retrieval systems to to emit a single time impulse for each relevant story. As such, retrieval systems will emit time-based IDs consisting of the broadcast ID plus a time. In creating the reference transcripts for this condition, previously untranscribed sections of the recordings such as commercials or filler segments will be automatically transcribed using the NIST ROVER software across several of the submitted recognizer transcript sets. The ROVER algorithm combines the output of multiple recognizers to create a more optimal transcript. We therefore expect the error rate for these sections to be slightly higher than the closed captions, but this will permit entire broadcasts to be indexed in the reference retrieval condition as well as relevance assessment over the entire broadcasts.
A Time ID consists of 2 fields separated by a column (:)
(e.g., 19980104_1130_1200_CNN_HDL:13.45)
In scoring, the TimeIDs will be mapped to Story IDs and duplicates will be eliminated. (See the Retrieval Scoring section for more details on the processing of this condition.)
Back to the Table of Contents.
For this condition, the temporal boundary of news stories in the collection will be "known". Boundary times are given in the SGML <Section> tags contained in the Index (*.ndx) files as well as in the LTT and SRT transcript filetypes. The <Section> tags specify the document IDs and start and end times for each story within the collection.
Note that sections of the waveform files containing commercials and other "out of bounds" material which have not be transcribed by the LDC will be exluded from retrieval in the Known Story Boundary Condition. The NDX files for this condition will indicate the proper subset of the corpus to be indexed and retrieved.
Note: Recognition systems developed for the Known Story Boundaries condition may use the story boundary information in segmenting the recordings and in skipping non-news segments. However, participants are encouraged to implement recognition in conformance with the rules for the Uknown Story Boundaries condition (recognize entire broadcast files and ignore the story boundaries for the recognition portion of the task) so that these transcripts can be used for both the Known and Unknown Story Boundaries conditions. NIST will supply a script to create a filtered copy of whole-broadcast recognized transcripts to add embedded story boundaries and remove non-news material so that these can be used for the Known Story Boundaries condition.
Note that except for the time boundaries and Story IDs provided in the <Section> tags, NO OTHER INFORMATION provided in the SGML tags may be used or indexed in any way for the test including any classification or topic information which may be present.
Back to the Table of Contents.
This is an online recognition/retrospective retrieval task. As such, two speech recognition modes are permitted - each with system date rules:
See Section 10 for details regarding these modes and acoustic and language model training requirements.
The retrieval system date is July 1, 1998.
Back to the Table of Contents.
No Development Test data is specified or provided for the SDR track although this year's training set may be split into the training/test sets used last year for development test purposes.
Back to the Table of Contents.
The 200 hours of LDC Broadcast News data collected in 1996, 1997 and January 1998 is designated as the suggested training material for the 1999 SDR evaluation. This, however, does not preclude the use of other training materials as long as they conform to the restrictions listed later in this section. There is no designated supplementary textual data for SDR language model training. However, sites are encouraged to explore the development of rolling language models using the NewsWire data provided in the TDT-2 corpus. Sites may choose either a "fixed" or "rolling" language model mode as described below for each of their recognition runs.
"Fixed" language model/vocabulary (FLM) systems: This is the traditional speech recognition evaluation mode in which systems implement fixed (non-time-adaptive) language models for recognition. If sites are implementing this recognition model, for all intents and purposes, the fixed recognition date for this evaluation will be 31 January 1998. Therefore, no acoustic or textual materials broadcast or published after this date may be used in developing either the recognition or retrieval system component. These systems will be referred to as Fixed Language Model (FLM) systems and will be dated 31 January 1998.
"Rolling" language model/vocabulary (RLM) systems: This option is supported to investigate the utility of using automatically-adapted evolving language models/vocabularies for recognition in temporal applications. These systems are permitted to use newswire data (not broadcast transcripts) from previous data days to automatically adapt their language models and vocabularies to implement recognition for the current day. For example, sites are permitted to use newswire material from March 17, 1998 to recognize audio material recorded on March 18, 1998. These systems will be referred to as Rolling Language Model (RLM) systems. The TDT-2 newsire portion of the corpus is available to support this mode. The TDT-2 newswire corpus contains approximately the same number of stories as the audio portion and was collected over the same time period.
Sites are permitted to investigate less frequent adapation schemes (e.g., weekly, monthly, etc.) so long as the material used for adapation always predates the current data day by at least one day.
Two recognition segmentation modes are included to support the story-boundaries-known and -unknown retrieval conditions:
For story-boundaries-unknown (SU) systems (required): Systems may not use story boundary timing information and must perform recognition on entire broadcasts as specified in the story boundaries unknown index files. Systems are permitted to attempt to AUTOMATICALLY screen out non-news sections such as commercials, but no manual segmentation may be used. These transcripts may be converted into SK-type transcripts with NIST-supplied software and may, therefore, be used for both story-boundaries-unknown and -known retrieval conditions.
For story-boundaries-known (SK) systems (optional): Systems may make use of story boundary timing information for segmentation purposes. They may also ignore non-news sections. However, this recognition mode is discouraged since the transcripts provided by this mode may not be used in story-boundaries-unknown retrieval conditions. All sites are encouraged to implement recognition of whole broadcasts without story boundaries for the Story Boundaries Unknown condition.
The following general rules apply to training for all recognition modes:
The granularity for adaptation for recognition is 1 day. The time of day that an episode (or an excerpt within an episode) was broadcast can be ignored. During recognition of episodes from the "current" day, only language model training data collected up through the "previous" day may be used. However, material for unsupervised acoustic model adaptation from the current day may be used. This implies that audio material to be recognized from the current day may be processed in any order using any adaptation scheme permitted by the above rules.
Note: "Current" refers to the date the episode to be recognized was broadcasted.
Sites are requested to report the training materials and adapation modes they employed in their site reports and TREC papers.
All acoustic and textual materials used in training must be publicly available at the time of the start of the evaluation.
Back to the Table of Contents.
The SDR track is an automatic ad hoc retrospective retrieval task. This means both that any collection-wide statistics may be used in indexing and that the retrieval system may NOT be tuned using the test topics. Participants may not use statistics generated from the reference transcripts collection in the baseline or recognizer transcript collections. Any auxiliary IR training material or auxiliary data structures such as thesauri that are used must predate the 01-JUL-1998 retrieval date. Likewise, any IR training material which is derived from spoken broadcast sources (transcripts) must predate the test collection (prior to 31-JAN-1998).
All sites are required to implement fully automatic retrieval. Therefore, sites may not perform manual query generation in implementing retrieval for their submitted results.
Participants are, of course, free to perform whatever side experiments they like and report these at TREC as contrasts.
For retrieval training purposes, the 1998 SDR-TREC-7 data is available as a set of 23 topics and relevance judgements and the 1999 SDR-TREC-8 data is available as a set of 49 topics and relevance judgements.
Back to the Table of Contents.
Interested sites are requested to register for the SDR Track as soon as possible. Registration in this case merely indicates your interest and does not imply a committment to participate in the track. Participants must register via the TREC Call for Participation Website at http://trec.nist.gov/cfp.html
Since this is a TREC track, participants are subject to the TREC conditions for participation, including signing licensing agreements for the data. Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but specific advertising claims based on TREC results is forbidden. The conference held in November is open only to participating groups that submit results and to government sponsors. (Signed-up participants should have received more detailed guidelines.)
All SDR participants will be permitted to attend the TREC-9 conference.
Participants must implement either Full SDR or Quasi-SDR retrieval as defined below. Note that sites may not participate by simply pipelining the baseline recognizer transcripts and baseline retrieval engine. Participants should implement at least one of the two major system components. As in previous SDR trakcs, sites with speech recognition expertise and sites with retrieval expertise are encouraged to team up to implement Full SDR.
The 2000 TREC-9 SDR Track has two participation levels and several retrieval conditions as detailed below. Given the large number of conditions this year, sites are permitted to submit only 1 run per condition.
Participation Levels:
Back to the Table of Contents.
The following are the retrieval conditions for the SDR Track. Note that some retrieval conditions are required and others are optional.
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
This condition provides a control condition using
a "standard" fixed recognizer. It also provides recognizer data
for sites without access to recognition technology who wish to
participate in the Quasi-SDR subset of the Track.
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
This condition provides a control condition using human-generated closed
caption quality recognition.
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
The non-lexical information is to be made available in Segmentation
Detection Table (.sdt) files as specified in
http://www.nist.gov/speech/sdr2000/doc/sdt_spec.txt
Sites submitting this information for sharing must format their
data according to the SDT specification.
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
Two runs are required for this condition: A run with short topics (S) and a run with terse topics (T).
Participants MUST use the SAME retrieval strategy for all conditions (that is, term weighting method, stop word list, use of phrases, retrieval model, etc must remain constant). For sites implementing S1 and/or S2 using non-word-based recognition (phone, word-spotting, lattice, etc.), they should use the closest retrieval strategy possible across conditions.
Sites may not use Word Error Rate or other measures as generated by scoring the recognizer transcripts against the reference transcripts to tune their retrieval algorithms in the S* or B* and CR* retrieval conditions (all conditions where recognized transcripts are used). The reference transcripts may not be used in any form for any retrieval condition except of course for R1.
Back to the Table of Contents.
The TREC-9 SDR Track will have 50 topics (queries) constructed by the NIST assessors. Each topic will be presented in two forms: A "short" form in which the topic is presented in a concise one or two sentence/phrase; A "terse" form in which the topic is presented in a two or three-word "Web-query" form. The topics will be clustered by type with the 50 "short" forms presented first followed by the second 50 "terse" forms. No knowledge of any relation between the short and terse forms should be used. They should be processed completely independently.
Short Examples:
What countries have been accused of human right violations?
Find reports of fatal air crashes.
What are the latest developments in gun control in the U.S.?
In particular, what measures are being taken to protect children
from guns?
Terse Examples:
"human rights" violations countries
fatal airline crash
gun control "U.S."
For SDR, the search topics must be processed automatically, without any manual intervention. Note that participants are welcome to run their own manually-assisted contrastive runs and report on these at TREC. However, these will not be scored or reported by NIST.
The topics will be supplied in written form. Given the number of retrieval conditions this year, spoken versions of the queries will not be included as part of the test set. However, participants are welcome to run their own contrastive spoken input tests and report on these at TREC.
Back to the Table of Contents.
Relevance assessments for the SDR Track will be provided by the NIST assessors. As in the Adhoc Task, the top 100-ranked documents for each topic from each system for the Reference Condition (R1U) will be pooled and evaluated for relevance. If time and resources permit, additional documents from other retrieval conditions may be added to the pool as well. Note that this approach is employed to make the assessment task manageable, but may not cover all documents that are relevant to the topics.
Note that the same relevance judgements will be used to score both the short form of the topics and the corresponding terse forms.
Back to the Table of Contents.
Since the focus of the SDR Track is on the automatic retrieval of spoken documents, manual indexing of documents, manual construction or modification of search topics, and manual relevance feedback may not be used in implementing retrieval runs for scoring by NIST. All submitted retrieval runs must be fully automatic. Note that fully automatic "blind" feedback and similar techniques are permissible and manually-produced reference data such as dictionaries and thesauri may be employed. Note the training and training date constraints specified in Section 10.
Participants are free to perform internal experiments with manual intervention and report on these at TREC.
Back to the Table of Contents.
In order for NIST to automatically log and process all of the many submissions which we expect to receive for this track, participants MUST ensure that their retrieval and recognition submissions meet the following filename and content specs. Incorrectly formatted files will be rejected by NIST.
For retrieval, each submission must have a filename of the following form: <SITE_ID>-<CONDITION>-<RECOGNIZER_ID>.ret where,
The following are some example retrieval submission filenames:
eth-r1su.ret (ETH retrieval using reference transcripts, short queries, with no boundaries)
cmu-b1tu.ret (CMU retrieval using Baseline 1 recognizer, terse queries, with no boundaries)
shef-b1sk.ret (Sheffield retrieval using Baseline 1 recognizer, short queries, with boundaries)
att-s1su-att1u.ret (AT&T retrieval, short topics, using AT&T 1 recognizer with no boundaries)
ibm-crtu-att1u.ret (IBM retrieval, terse topics, using AT&T 1 recognizer with no boundaries)
As in other TREC tracks, for the story-boundaries-known condition the output of a
retrieval run is a ranked list of story (document) ids as identified
in the NDX files and <Section> tags in the R1 and B1 transcripts.
These will be submitted to NIST for scoring using the standard TREC
submission format (a space-delimited table):
23 Q0 19980104_1130_1200_CNN_HDL.0034 1 4238 ibm-b1sk
23 Q0 19980105_1800_1830_ABC_WNT.0143 2 4223 ibm-b1sk
23 Q0 19980105_1130_1200_CNN_HDL.1120 3 4207 ibm-b1sk
23 Q0 19980515_1630_1700_CNN_HDL.0749 4 4194 ibm-b1sk
23 Q0 19980303_1600_1700_VOA_WRP.0061 5 4189 ibm-b1sk
etc.
Field Content:
The Story IDs are given in the Section (story boundary) tags.
*Note that field 5 MUST be in descending order so that ties may be handled properly. This number (not the rank) will be used to rank the documents prior to scoring. The site-given ranks will be ignored by the 'trec_eval' scoring software.
Participants may submit lists with more than 1000 documents for each topic. However, NIST will truncate the list to 1000 topics.
For the story-boundaries-unknown condition, field 3 will be a episode/
time tag of the form: <Episode-ID>:<Time-in-Seconds.Hundredths> for
the retrieved excerpt:
23 Q0 19980104_1130_1200_CNN_HDL:39.52 1 4238 ibm-crsu-att1u
23 Q0 19980105_1800_1830_ABC_WNT:143.69 2 4223 ibm-crsu-att1u
23 Q0 19980105_1130_1200_CNN_HDL:1120.02 3 4207 ibm-crsu-att1u
23 Q0 19980515_1630_1700_CNN_HDL:749.81 4 4194 ibm-crsu-att1u
23 Q0 19980303_1600_1700_VOA_WRP:61.02 5 4189 ibm-crsu-att1u
etc.
Sites are to submit their retrieval output to NIST for scoring using standard TREC procedures and ftp protocols. See the TREC Website at http://trec.nist.gov for more details.
Back to the Table of Contents.
As in last year's SDR Track, only 1-Best-algorithm recognizer transcripts will be accepted by NIST for scoring and if received in time, will be shared across sites for the Cross-Recognizer retrieval conditions. Sites performing Full-SDR not using a 1-Best recognizer are encouraged to self-evaluate their recognizer in their TREC paper.
Since the concept of recognizer transcript sharing for Cross-Recognizer Retrieval experiments appeared to be broadly accepted last year, NIST will assume that all submitted recognizer transcripts are to be scored and made available to other participants for Cross-Recognizer Retrieval. If you would like to submit your recognizer transcripts for scoring, but do NOT want them shared, you must notify NIST (jerome.lard@nist.gov) of the system/run to exclude from sharing PRIOR to submission.
Submitted 1-Best recognizer transcripts must be formatted as follows:
Each recognizer transcript (one per show) is to have a filename of the
following form:
<EPISODE>.srt where,
A System Description file must be created for each submitted
set of recognizer-produced transcripts which outlines
pertinent features of the recognition system used. The file
should be named:
<RECOGNIZER><RUN>.desc where,
Minimally, the system description MUST identify the language model mode which was employed: "Fixed" or "Rolling". If a rolling language model was used, the update period should be identified.
The format for the System Description is as follows:
System ID: (eg, NIST1U)
The SRT files and System Description File should be placed in a directory with the following name: <RECOGNIZER><RUN> where,
Submit your SRT files as follows:
A gnu-zipped tar archive of the above directory should then be created (e.g., att1u.tgz) using the -cvzf options in GNU tar. This file can now be submitted to NIST for scoring/sharing via anonymous ftp to jaguar.ncsl.nist.gov using your email address as the password. Once you are logged in, cd to the "incoming/sdr2000" directory. Set your mode to binary and "put" the file. This is a "blind" directory, so you will not be able to "ls" your file. Once you have uploaded the file, send email to jerome.lard@nist.gov to indicate that a file is waiting. He will send you a confirmation after the file is successfully extracted and email again later with your SCLITE scores. To keep things simple and file sizes down, please submit separate runs (s1 and s2) in separate tgz files.
The submitted output of a 1-Best recognizer must be in the standard SDR Speech Recognizer Trancription (SRT) format. See Appendix A for an example.
Back to the Table of Contents.
The TREC-9 SDR Track retrieval performance will be scored using the
NIST "trec_eval" Precision/Recall scoring software. A "shar" file
containing the trec_eval software is available via anonymous ftp from
the following URL:
ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar
For TREC-9 SDR, the primary retrieval measure will be Mean Average Precision. Other retrieval measures will include: Precision at standard Document rank cutoff levels, single number Average Precision over all relevant documents, and single number R-Precision, precision after R relevant documents retrieved.
These measures are defined in Appendices to TREC Proceedings, and may also be found on the TREC Website at http://trec.nist.gov.
For the known-story-boundaries condition, NIST will truncate the submitted list to 1000 documents and score it using trec_eval.
For the story-boundaries-unknown (U) retrieval conditions, NIST will programmatically do the following :
The mapping will simply involve converting the time tag to the story ID of the story that the identified time resides within.
NIST will provide a mapping/filtering tool to implement (1 - 2): UIDmatch.pl - convert time-based retrieval output to doc-based output format for trec_eval scoring. See the IR scoring tools page.
Back to the Table of Contents.
TREC-9 Full SDR participants who use 1-best recognition are encouraged to submit their recognizer transcripts in SRT form for scoring by NIST. NIST will employ its "sclite" Speech Recognition Scoring Software to benchmark the Story Word Error Rate for each submission. These scores will be used to examine the relationship between recognition error rate and retrieval performance. Note that to ensure consistency among all forms of the evaluation collection, all SRTs submitted for the Story Boundaries Known retrieval conditions received will be filtered to remove any speech outside the evaluation per the corresponding NDX files.
A randomly-selected 10-hour subset of the SDR collection will be transcribed in Hub-4 form so that the speech recognition transcripts can be scored. This will provide the primary speech recognition measures for the SDR track.
The NIST SCLITE Scoring software is available via the following URL: http://www.nist.gov/speech/software.htm. This page contains a ftp-able link to SCTK, the NIST Speech Recognition Scoring Toolkit which contains SCLITE. The SCLITE software may be updated to accomodate large test sets. The SDR email list will be notified as updates become available.
NIST will provide the following additional scripts to permit useful transformations of the SDR speech recognizer transcripts:
See the Speech scoring tools page.
Note that two forms of NDX files will be provided: 1 set for the Story Boundaries Known (SK) condition and another set for the Story Boundaries Unknown (SU) Condition. The ctm2srt.pl filter and SK NDX file can be used with a CTM file created for the SU condition to create an SRT file for the SK condition as follows:
Since unverified reference transcripts are used in the SDR Track, the SDR Word Error Rates should not be directly compared to those for Hub-4 evaluations which use carefully checked/corrected annotations and special orthographic mapping files.
Back to the Table of Contents.
| Site registration: | ASAP |
| SPH and NDX available (recognition task begins) | 02 Apr 2000 |
| Recognizer transcripts (SRTs) due for scoring/sharing | 21 Jun 2000 |
| Non-lexical information files (SDT) due for sharing | 21 Jun 2000 |
| NDXs, LTTs, SRTs, topics available (all retrieval tasks begin) | 30 Jun 2000 |
| All search results due at NIST | 14 Aug 2000 9am EDT |
| Relevance judgements released by NIST | 02 Oct 2000 |
| Scored Retrieval Results released by NIST | 02 Oct 2000 |
| Conference workbook papers to NIST | 25 Oct 2000 (estimated) |
| TREC-9 Conference | 13-16 Nov 2000 |
Back to the Table of Contents.
Participants must make arrangements with the Linguistic Data Consortium to obtain use of the TDT-2 recorded audio and transcriptions used in the SDR Track. The recorded audio data is available in Shorten-compressed form on approximately 75 CD-ROMs. The transcription and associated textual data will be made available via ftp or via CD-ROM by special request.
The contact at the LDC for obtaining access to SDR corpora is:
Shannon Sears
Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: (215) 898-0464
Fax: (215) 573-2175
Email: ldc@unagi.cis.upenn.edu
WWW: http://www.ldc.upenn.edu
Several Licensing and pricing options are available to SDR Track participants for access to the TDT-2 corpus:
Test-Only sites must sign a TREC-9 SDR "evaluation only" license agreement and are required to return the discs and delete all waveform/transcript files from their systems within 30 days after completion of the evaluation or be charged the for-profit subscription fee.
ALL SITES MUST SIGN LICENSE AGREEMENTS AVAILABLE FROM THE LDC TO OBTAIN ACCESS TO THE SDR DATA.
Specifics regarding obtaining the data sets for the SDR Track will be made available later in email.
Back to the Table of Contents.
Participants are asked to give full details in their Workbook/Proceedings papers of the resources used at each stage of processing, as well as details of their SR and IR methods. Participants not using 1-best recognizers for Full-SDR should also provide an appropriate analysis of the performance of the recognition algorithm they used and its effect on retrieval.
Back to the Table of Contents.
Any general questions regarding TREC should be addressed to the TREC project
Manager:
Ellen Voorhees, ellen.voorhees@nist.gov
You can also refer to htpp://trec.nist.gov.
Any questions regarding the SDR Track should be addressed to the Track
Organizer:
John Garofolo, john.garofolo@nist.gov (Speech)
Email discussion regarding the SDR Track can be addressed
to the Track Participant List:
sdr_list@nist.gov
Questions regarding licensing and obtaining training and test data
should be addressed to the Linguistic Data Consortium:
Shannon Sears
Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: (215) 898-0464
Fax: (215) 573-2175
Email: ldc@unagi.cis.upenn.edu
WWW: http://www.ldc.upenn.edu
Back to the Table of Contents.
Note: All transcription files are SGML-tagged.
.sph - SPHERE waveform: SPHERE-formatted digitized recording of a broadcast, used as input to speech recognition systems. Waveform format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order.
NIST_1A
(digitized 16-bit waveform follows header)
1024
sample_count -i 27444801
sample_rate -i 16000
channel_count -i 1
sample_byte_format -s2 10
sample_n_bytes -i 2
sample_coding -s3 pcm
sample_min -i -27065
sample_max -i 27159
sample_checksum -i 31575
database_id -s7 Hub4_96
broadcast_id NPR_MKP_960913_1830_1900
sample_sig_bits -i 16
end_head
.
.
.
.ltt - Lexical TREC Transcription: ASR-style reference transcription with all SGML tags removed except for Episode and Section. "Non-News" Sections are excluded. This format is used as the source for the Reference Retrieval condition.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
it's friday september thirteenth i'm david brancaccio and here's some of what's happening in business and the world
</Section>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
agricultural products giant archer daniels midland is often described as politically well connected any connections notwithstanding the federal government is pursuing a probe into whether the company conspired to fix the price of a key additive for livestock feed
...
</Section>
...
</Episode>
.ndx - Index: Specifies <Sections> in waveform and establishes story boundaries and ID's. Similar to LTT format without text. Non-transcribed Sections are excluded.
For the known story boundaries condition, the ndx format will require one
Section tag per story as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
...
</Episode>
For the unknown story boundaries condition, the ndx format will require a
single "FAKE" Section tag that will encompass the entire Episode as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News"
Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
</Episode>
Note that the start time of the fake section is the start time of the first NEWS story and
the end time is the end time of the last NEWS story of the show.
.srt - Speech Recogniser Transcript (contrived example): Output of speech recogniser for a .sph recorded waveform file which will be used as input for retrieval. Each file must contain an <Episode> tag and properly interleaved <Section> tags taken from the corresponding .ndx file. Each <Word> tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word.
For the known story boundaries condition, the Section tags follow the ones specified in the ndx file.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
...
</Episode>
For the unknown story boundaries condition, the srt format will require a single "null" Section tag that will encompass the entire Episode as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
...
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
</Episode>
Back to the Table of Contents.
| srt2ltt.pl | This filter transforms the Speech Recognizer Transcription (SRT) format with word times into the Lexical TREC Transcription (LTT) form. This resulting simplified form of the speech recogniser transcription can be used for retrieval if word times are not desired. |
| srt2ctm.pl | This filter transforms the Speech Recognizer Transcription (SRT) format into the CTM format used by the NIST SCLITE Speech Recognition Scoring Software. |
| ctm2srt.pl | This filter together with the corresponding NDX file transforms the CTM format used by the NIST SCLITE Speech Recognition Scoring Software into the SDR Speech Recognizer Transcription (SRT) format. Material not specified in the NDX time tags is excluded. |
Back to the Table of Contents.