<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • Topic Detection and Tracking Resources

    There are a number of resources available for researchers who are interested in the TDT field of research. They include: TDT Corpora TDT2, TDT3, and TDT4 Evaluation Infrastructure resources, Mandarin Chinese resources and Technical information from the TDT 1999, TDT 2000, TDT 2001, TDT 2003 and TDT 2004 Workshops.

    TDT Corpora

    Corpora and Mandarin language resources for system development and evaluation will be provided by the Linguistic Data Consortium (LDC). The LDC currently has five TDT corpora (http://www.ldc.upenn.edu/TDT) available for system development, the TDT Pilot study (TDT-Pilot), the TDT Phase 2 (TDT2), the TDT Phase 3 (TDT3), the TDT3 Arabic supplement, the TDT Phase 4 (TDT4) corpus, and the TDT Phase 5 (TDT5) corpus.

    Contact the LDC at ldc@ldc.upenn.edu to obtain these materials. If you already have the TDT2 corpus, you can verify possession of the latest TDT2 updates through the LDC's TDT2 Current Release webpage.

    Evaluation Infrastructure Resources

    NIST is providing evaluation infrastructure support for TDT. NIST has developed a suite of TDT evaluation scripts. The package is called TDT3eval and the latest version is available from the URL ftp://jaguar.ncsl.nist.gov/tdt/tdt2004/software/TDT3eval_v2.6.tgz.

    The suite contains an evaluation script for each of the 5 TDT evaluation tasks.

    The suite includes a script to generate evaluation index files, (See the TDT98 and TDT99 evaluation plans for a descriptions).

    A second package was built for evaluating hiearchical topic detection systems.  HTDEval V1.4 is a available from the URL ftp://jaguar.ncsl.nist.gov/tdt/tdt2004/software/HTD-1.4-20040914-0844.tgz.

    Mandarin Chinese Resources

    The LDC has prepared a very useful web page containing pointers to LDC and WWW Mandarin language resources. The page will updated as new resources are available.

    Arabic Resources

    There are currently two resources available from the LDC. First is a supplemental corpus for the TDT3 data (LDC2002E32). It consists of a subsampling of the "Arabic Newswire Text Part 1" (described below) matching the TDT3 epoch. The data has been formatted according to the previously established TDT conventions and will therefore work with existing TDT evaluation software.

    The LDC previously published the "Arabic Newswire Text Part 1" (LDC2001T55), which spans a 6-year archive from Agence France Press (AFP). The collection includes articles from 13 May 1994 to 20 December 2000, care must be taken to not use data from the TDT 2002 evaluation epoch which begins October 1, 2000.

    As part of the 2002 TREC Arabic information retrieval evaluation, the University of Maryland has assembled a set of Arabic resources.

     

     

    Page Created: August 21, 2007
    Last Updated: November 4, 2008

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA