The following textual resources are available from the LDC for SDR recognition/retrieval training:
| Source | Dates Covered | Approx. # Words (Millions) |
|---|---|---|
| Los Angeles Times & Washington Post | 05/94-08/97 | 52 |
| New York Times News Syndicate | 07/94-12/96 | 173 |
| Reuters News Service (General & Financial) | 04/94-12/96 | 85 |
| Wall Street Journal | 07/94-12/96 | 40 |
| Source | Dates Covered | Approx. # Words (Millions) |
|---|---|---|
| Los Angeles Times & Washington Post | 09/97-04/98 | 11 |
| New York Times News Syndicate | 01/97-04/98 | 116 |
| Associated Press World Stream English | 11/94-04/98 | 143 |
| Source | Dates Covered | Approx. # Words (Millions) |
|---|---|---|
| New York Times News Syndicate | 01/98 - 06/98 | 18 |
| Associated Press World Stream English | 01/98 - 06/98 | 32 |
| Source | Dates Covered | Approx. # Words (Millions) |
|---|---|---|
| Los Angeles Times and Washington Post | 05/98 - 06/98 | --- |
Some of this material is contemporaneous with the 1999 SDR test material and may be used by sites wishing to create rolling language model recognition systems for the recognition component of the SDR Track. Contemporaneous data may not be used in creating traditional fixed language model recognition systems for the SDR task.
Note that some of these materials overlap.
A compendium containing the 1998 Hub-4 English Language training, development test, and evaluation test transcripts in the Universal Transcription Format (UTF) is available from the LDC via LDC Order Number LDC98E10. A subset of the transcripts in this release have been SGML-tagged with named-, numeric- and time-entities and provided the training and development test reference transcripts for the 1998 Information Extraction - Named Entity Spoke. Contact the LDC to obtain this release.
The Hub-4 Broadcast News corpora are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.
The January 1998 TDT-2 Data (.ndx files)
which was excluded from the test collection can be used as training data collection for Sites needing it.
In addition, you can download the corresponding .ltt files of January data corpus.
The following resources that were used in the previous tests could still be used for current tests: