1998 Topic Detection and Tracking Evaluation (TDT-2)
This page contains information and links to files for the 1998 TDT technology evaluation project. Note that it will be updated periodically as new materials and information become available. Members of the TDT email list will be notified of updates.
BACKGROUNDThe purpose of the 1998 Topic Detection and Tracking (TDT-2) project is to advance the state of the art in technologies required to segment, detect, and track topical information in an information stream. The general TDT task domain is to be explored and technology is to be developed in the context of an evaluation-driven R&D paradigm, in which key technical challenges are defined and supported by formal evaluations. Three key technical challenges will be explored in TDT-2: Topic Segmentation, Topic Detection, and Topic Tracking.
The TDT-2 project addresses multiple sources of information in the form of both text and speech from newswire and radio and television news broadcast programs. The information flowing from each source is modeled as a sequence of stories (and non-stories). These stories may provide information on one or more topics. The technical challenge is to identify and to follow the topics being discussed in these stories.
This overview of the TDT-2 program was provided by Charles Wayne.
INSTRUCTIONS AND DOCUMENTATIONThe 1998 TDT-2 Evaluation Specification Version 3.7 (in PDF)/ (in PostScript) is the core document for the TDT Evaluation and contains detailed information regarding participation, implementation, and schedule. If you intend to participate in the TDT Evaluation, read this document before using the TDT data or building algorithms for the evaluation.
A TDT Pilot Study (TDT-1) was conducted in 1997 and involved Topic Detection, Tracking, and Segmentation using a smaller news corpus. Information regarding the TDT-1 Corpus can be found on the LDC TDT Website.
CORPORAThe TDT-2 Corpus when complete will contain approximately 60,000 news stories from AP WorldStream, NY Times News service, CNN Headline News, ABC World News Tonight, Voice of America World News, and Public Radio International The World (transcripts are provided for the audio sources). A set of 100 target topics will be identified for the corpus and the corpus will be divided into training, development test, and evaluation test subsets of approximately equal size. See the LDC Website or Contact the LDC to obtain the TDT-2 training material. The Development Test material will be made available this summer and the Evaluation Test Material will be released in the Fall prior to the TDT-2 Evaluation in December.
The TDT-1 Pilot Corpus contains 15,863 news stories from Reuters North American and CNN Broadcast Transcripts . A set of 25 target events has been identified for the corpus. See the LDC Website or Contact the LDC to obtain the TDT-1 Pilot Corpus.
Documentation regarding the TDT-1 and TDT-2 corpora are available on the LDC TDT Website
SOFTWAREThe NIST TDT-2 Scoring Software can be used to score your Detection, Tracking, and Segmentation runs. (Note that it requires the prior installation of PERL5.) After decompressing and tar-extracting the archive, see the file, "readme.txt" for installation and usage details. Note that the scoring software is currently being revised to add functionality requested in the May TDT meeting. A new version will be released later in June and made available here.
For those sites who wish to experiment with the TDT-1 newswire corpus, the UMASS TDT-1 scoring software can be used. Note that this software is made available as-is and is NOT supported by NIST.
DATA LICENSINGThe TDT-1 and TDT-2 corpora are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information. See the Evaluation Specification for details regarding the use of the TDT-2 corpus in the official evaluation.
CONTACT INFORMATIONIf you are interested in participating in TDT, would like to be added to the TDT email list, or have questions about the evaluation protocols and software, contact speech_webmaster[at]nist.gov.
Questions regarding the TDT corpora and obtaining access to it should be directed to firstname.lastname@example.org
Page Created: August 21, 2007
Multimodal Information Group
is part of
NIST is an agency of the U.S. Department of Commerce
Accessibility Statement | Disclaimer | FOIA