<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • Training Resources

    Textual resources

    The following textual resources are available from the LDC for SDR recognition/retrieval training:

    • LDC95T21 North American News Text Corpus:

    • Source Dates Covered Approx. # Words (Millions)
      Los Angeles Times & Washington Post 05/94-08/97 52
      New York Times News Syndicate 07/94-12/96 173
      Reuters News Service (General & Financial) 04/94-12/96 85
      Wall Street Journal 07/94-12/96 40

    • LDC98T30 North American News Text Supplement:

    • Source Dates Covered Approx. # Words (Millions)
      Los Angeles Times & Washington Post 09/97-04/98 11
      New York Times News Syndicate 01/97-04/98 116
      Associated Press World Stream English 11/94-04/98 143

    • LDC99E12 SDR-99 Text Compendium:

    • Source Dates Covered Approx. # Words (Millions)
      New York Times News Syndicate 01/98 - 06/98 18
      Associated Press World Stream English 01/98 - 06/98 32

    • LDC99E13 North American News Text Supplement 2:

    • Source Dates Covered Approx. # Words (Millions)
      Los Angeles Times and Washington Post 05/98 - 06/98 ---

    Some of this material is contemporaneous with the 1999 SDR test material and may be used by sites wishing to create rolling language model recognition systems for the recognition component of the SDR Track. Contemporaneous data may not be used in creating traditional fixed language model recognition systems for the SDR task.

    Note that some of these materials overlap.

    Back to top.

    Speech resources

    A compendium containing the 1998 Hub-4 English Language training, development test, and evaluation test transcripts in the Universal Transcription Format (UTF) is available from the LDC via LDC Order Number LDC98E10. A subset of the transcripts in this release have been SGML-tagged with named-, numeric- and time-entities and provided the training and development test reference transcripts for the 1998 Information Extraction - Named Entity Spoke. Contact the LDC to obtain this release.

    The Hub-4 Broadcast News corpora are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.

    Back to top.

    Resources from Previous Tests (Topics/Assessments)

    The following resources that were used in the previous tests could still be used for current tests:

    Back to top.

     

     

    Page Created: August 23, 2007
    Last Updated: November 4, 2008

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA