If you still have questions, please refer to this document to find complementary information.
A: There are five file formats that you should be aware of:
The recorded audio collection to be recognized is formatted using the
NIST SPHERE waveform file format. A UNIX-based toolkit to
read and manipulate SPHERE files is available via the NIST Spoken Language Technology Evaluation and Utility Software page.
.sph - SPHERE waveform: SPHERE-formatted digitized recording of a broadcast, used as input to speech recognition systems. Waveform format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order.
NIST_1A
(digitized 16-bit waveform follows header)
1024
sample_count -i 27444801
sample_rate -i 16000
channel_count -i 1
sample_byte_format -s2 10
sample_n_bytes -i 2
sample_coding -s3 pcm
sample_min -i -27065
sample_max -i 27159
sample_checksum -i 31575
database_id -s7 Hub4_96
broadcast_id NPR_MKP_960913_1830_1900
sample_sig_bits -i 16
end_head
.
.
.
.ltt - Lexical TREC Transcription: ASR-style reference transcription with all SGML tags removed except for Episode and Section. "Non-News" Sections are excluded. This format is used as the source for the Reference Retrieval condition.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
it's friday september thirteenth i'm david brancaccio and here's some of what's happening in business and the world
</Section>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
agricultural products giant archer daniels midland is often described as politically well connected any connections notwithstanding the federal government is pursuing a probe into whether the company conspired to fix the price of a key additive for livestock feed
...
</Section>
...
</Episode>
.ndx - Index: Specifies <Sections> in waveform and establishes story boundaries and ID's. Similar to LTT format without text. Non-transcribed Sections are excluded.
For the known story boundaries condition, the ndx format will require one
Section tag per story as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
...
</Episode>
For the unknown story boundaries condition, the ndx format will require a
single "FAKE" Section tag that will encompass the entire Episode as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News"
Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
</Episode>
Note that the start time of the fake section is the start time of the first NEWS story and
the end time is the end time of the last NEWS story of the show.
.srt - Speech Recogniser Transcript (contrived example): Output of speech recogniser for a .sph recorded waveform file which will be used as input for retrieval. Each file must contain an <Episode> tag and properly interleaved <Section> tags taken from the corresponding .ndx file. Each <Word> tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word.
For the known story boundaries condition, the Section tags follow the ones specified in the ndx file.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
...
</Episode>
For the unknown story boundaries condition, the srt format will require a single "null" Section tag that will encompass the entire Episode as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
...
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
</Episode>
.ctm - This describes the time marked conversation input files to be used for scoring the output of speech recognizers via the NIST sclite program. Both the reference and hypothesis input files can share this format.
The ctm file format is a concatenation of time mark records for each word in each channel of a waveform. The records are separated with a newline. Each word token must have a waveform id, channel identifier which is always 1 for SDR, start time, duration, and word text.
CTM :== <F> <C> <BT> <DUR> word [ <CONF> ] where:
Example CTM file :
;;
;; Comments follow ';;'
;;
ea980107 1 3.620 0.180 ON
ea980107 1 3.800 0.240 WORLD
ea980107 1 4.040 0.180 NEWS
ea980107 1 4.220 0.270 TONIGHT
ea980107 1 4.490 0.190 THIS
ea980107 1 4.680 0.390 WEDNESDAY
For more information regarding the ctm format, see the SCLITE documentation.
.stm - The stm (segment time marked) format describes the reference transcript file format to be used for scoring the output of speech recognizers via the NIST sclite() program. An stm file consists of a concatenation of text segment records transcribed from a recorded waveform file. Each record is separated by a newline and contains: the waveform's filename and channel identifier [A | B], the talkers id, begin and end times (in seconds), optional subset label and the text for the segment. Each record follows the following BNF format:
The stm (segment time marked) format describes the reference transcript file format to be used for scoring the output of speech recognizers via the NIST sclite() program. An stm file consists of a concatenation of text segment records transcribed from a recorded waveform file. Each record is separated by a newline and contains: the waveform's filename and channel identifier [A | B], the talkers id, begin and end times (in seconds), optional subset label and the text for the segment. Each record follows the following BNF format:STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript ... where :
The list of words can contain any transcript alternation using the following
BNF format:
ALTERNATE :== "{" <text> ALT+ "}"
ALT :== "/" <text>
TEXT :== 1 thru n words | "@" | ALTERNATE
The "@" represents a NULL word in the transcript. For scoring purposes,
an error is not counted if the "@" is aligned as an insertion.
Example: "i've { um / uh / @ } as far as i'm concerned"
Example STM file:
;; comment
2345 A 2345-a 0.10 2.03 uh huh yes i thought
2345 A 2345-b 2.10 3.04 dog walking is a very
2345 A 2345-a 3.50 4.59 yes but it's worth it
The file must be sorted by the first and second columns in ASCII order, and the fourth in numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +3nb -4" will sort the words into appropriate order.
Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.
For more information regarding the stm format, see the SCLITE documentation.
The corpus for this evaluation is the February through June subset of the TDT-2 corpus as specified in Section 6 of the Evaluation Specification. The TDT-2 speech waveform files are available on CD-ROM from the Linguistic Data Consortium (LDC) (LDC order number: LDC99S82).
The specifications for recognition training and implementation for the SDR track are given in Section 9 of the Evaluation Specification. Please read this document carefully!
A: The files to be recognized are provided for each of the two permitted recognition modes in the ndx format:
This file contains the list of speech waveform files and excerpts to be recognized for the Story Boundaries Known recognition mode. Only the portions specified in this file are to be recognized. Note that this recognition mode is generally discouraged since it cannot be transformed for use in Story Boundaries Unknown retrieval modes. The recognized transcripts created for the Story Boundaries Unknown conditions may be converted to this form automatically using the following procedure.
This file contains the list of entire episodes to be recognized for the Story Boundaries Unknown recognition mode. Note that these episodes may be a subset of the material in the waveform file they are contained within. So, only the material specified in the index files should be recognized. The recognized transcripts created for this condition may be converted to the form required in the Story Boundaries Known condition via the following procedure.
To generate srt files from ctm files, you will need the corresponding ndx file for each show to limit scoring to excerpts for which there is a reference transcript.
Here are the steps to do this:
Scoring should be performed only on filtered files. Apply filtering as described above to all the ctm files to be scored.
Create one large concatenated ctm file for scoring.
Example: cat *.ctm | sort +0 -1 +1 -2 +2nb -3 > ../Result.ctm
The sort command makes sure the resulting ctm is sorted
according to ctm requirements.
Score the recognizer-produced ctm file against the reference
transcript. The scoring process may be launched using the perl script sdr-score.pl.
Given a reference transcript file in stm format and a recognizer-produced hypothesis transcript file in
ctm format, it will perform all the necessary checks including software
version verification, some orthographic normalization, and the scoring.
Example: sdr-score.pl -r sdr98-reference.stm ./result.ctm
A:
Any general questions regarding TREC should be addressed to the TREC project
Manager:
Ellen Voorhees, ellen.voorhees@nist.gov
You can also refer to htpp://trec.nist.gov.
Any questions regarding the SDR Track should be addressed to the Track
Organizer:
John Garofolo, john.garofolo@nist.gov (Speech)
Email discussion regarding the SDR Track can be addressed
to the Track Participant List:
sdrlist@jaguar.ncsl.nist.gov
Questions regarding licensing and obtaining training and test data
should be addressed to the Linguistic Data Consortium:
Shannon Sears
Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: (215) 898-0464
Fax: (215) 573-2175
Email: ldc@unagi.cis.upenn.edu
WWW: http://www.ldc.upenn.edu