Version 4.5, 04-Mar-1997
Introduction
The 1997 Hub-5E evaluation is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of conversational speech recognition. To this end the evaluation was designed to be simple, to focus on core speech technology issues, to be fully supported, and to be accessible.
The Hub-5E evaluation, conducted in the Spring, complements another related evaluation which is conducted in the Fall. The Fall evaluation focuses primarily on the recognition of multiple languages and on issues related to porting recognition technology to new languages, to system generality, and to language commonalties and universals. This evaluation is dedicated to the advancement of speech recognition technology for English, and specifically for General American English.
The 1997 Hub-5E evaluation will be conducted in March. (Data will go out on March 10th, and results are due back by March 31st.) A follow-up workshop for evaluation participants will be held during mid May (May 13th _ 15th) to discuss research findings. Participation in the evaluation is solicited for all sites that find the task worthy and the evaluation of interest. For more information, and to register a desire to participate in the evaluation, please contact Dr. Alvin Martin at NIST.
The Hub-5E evaluation focuses on the task of transcribing
conversational speech into text. This task is posed in the context of
conversational telephone speech in General American English. The evaluation
is designed to foster research progress, with the goals of:
The task is to transcribe conversational speech. The speech to be transcribed is presented as a set of conversations collected over the telephone. Each conversation is represented as a "4-wire" recording, that is with two distinct sides, one from each end of the telephone circuit. Each side is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit mu-law encoding).
Each conversation is represented as a sequence of "turns", where each turn is the period of time when one speaker is speaking. Each successive turn results from a reversal of speaking and listening roles for the conversation participants. The transcription task is to produce the correct transcription for each of the specified turns. The beginning and ending times 1 of each of these turns will be supplied as side information to the system under test . This information, stored in a single PEM file 2 , will determine the test material.
The American Heritage Dictionary (AHD) 3 will serve as the standard reference for word spellings. Words that don't occur in the AHD will be spelled using the most common accepted spelling.
All of the SwitchBoard corpus (SwitchBoard-1), and all of the Call_Home English training corpus (the first 100 conversations), may be used for training. Also the 20 Call_Home English conversation that were designated as the Spring 1996 development set, may be used for training. These corpora are available from the LDC. Additional data may also be used for training, provided that the data are made publicly available at the time of reporting results.
The April 1996 Call_Home English EvalSet (20 conversations) will serve as the DevSet. Segment time marks and corresponding SNOR transcriptions for these data will be provided in standard STM 4 (segment time marked) format. The file names for these 20 conversations are
| 1. en_4792 | 6. en_6047 | 11. en_6313 | 16. en_6479 |
| 2. en_4801 | 7. en_6071 | 12. en_6408 | 17. en_6521 |
| 3. en_4829 | 8. en_6179 | 13. en_6447 | 18. en_6625 |
| 4. en_5872 | 9. en_6265 | 14. en_6456 | 19. en_6785 |
| 5. en_5888 | 10. en_6298 | 15. en_6467 | 20. en_6825 |
This subset is the first 30 seconds
of speech (from the 5 minute evaluation excerpt) for each conversation
side of all conversations. The actual elapsed time will be at least
30 seconds duration, in order to capture 30 seconds of speech. (In
some cases the elapsed time may reach the limit of 5 minutes if the
speaker mostly listens and doesn't say much.) Segment time marks and
corresponding transcriptions for subset 1 will be provided in standard
STM format.
This subset is simply 7 of the 20 DevSet conversations. These conversations are selected to be a representative sampling of the speaker's sex and word error rate. Segment time marks and corresponding transcriptions for subset 2 will be provided in standard STM. The file names of the selected conversations are
| 1. en_4792 | 3. en_6179 | 5. en_6521 | 7. en_6825 |
| 2. en_5872 | 4. en_6456 | 6. en_6625 |
The EvalSet will comprise 20 conversations each from Call_Home English and Switchboard-2, for a total of 40 conversations. Whole conversations will be supplied, but recognition will be scored only for a 5 minute excerpt chosen from each conversation. Speaker turn segmentation information for these 5 minute excerpts will be supplied to guide the recognition system. This segmentation information will be supplied in NIST's PEM file format.
While the Evaluation data will come from two different sources (namely Call_Home and Switchboard-2), the identity of the source is not to be provided to the system under test. The system must either recognize the speech irrespective of the source or must automatically determine the source from examination of the speech signal.
The Switchboard-2 conversations will be provided after the application of an echo cancellation algorithm. This algorithm is being used to maximize the separation of the two sides of each conversation. The echo cancellation software and documentation are available from Mississippi State's ISIP laboratory 5 via anonymous ftp.
Version 2.5 of the echo cancellation software will be used. A perl wrapper script to process SPHERE-headered speech files will also be provided.
Echo cancellation will not be performed on the Call_Home English data. The reason is that the Call_Home data were collected from a telephone system that already had (nonlinear) echo suppression incorporated in it.
Each system will be evaluated by measuring that
system's word error rate (WER). Each system will also be evaluated
in terms of its ability to predict recognition errors. System performance
will be evaluated over an ensemble of conversations. These conversations
will be chosen to represent a statistical sampling of conditions
of evaluation interest. For the 1997 evaluation these conditions
will include sex, geographical distribution and age.
The reference transcriptions are intended to be
as accurate as possible, but there will necessarily be some ambiguous
cases and outright errors. In view of the existing high error
rates of automatic recognizers on this type of data, it is not
considered cost effective to generate multiple independent human
transcriptions of the data or to have a formal adjudication procedure
following the evaluation submissions.
The reference transcription for each turn will be limited to a single sequence of words. This word sequence will represent the transcriber's best judgment of what the speaker said.
Word fragments will be represented by an initial part of a word with a hyphen at the end. Correct recognition will consist of either ignoring the fragment, or producing a word of which the fragment is an initial part.
The reference transcription will contain no hyphenated
words. Each hyphenated word will be separated into its separate
constituent words.
Word error rate is defined as the sum of the number of words in error divided by the number of words in the reference transcription. The words in error are of three types, namely substitution errors, deletion errors, and insertion errors. Identification of these errors results from the process of mapping the words in the reference transcription onto the word in the system output transcription. This mapping is performed using NIST's SCLITE software package 6 .
Some variant spellings of the same word exist in the transcriptions. These words, with or without hyphens, will be mapped onto a single preferred word spelling without hyphens. The set of all such mappings is:
| Input | Output |
| mhm | uhhuh |
| mmhm | uhhuh |
| mm-hm | uhhuh |
| mm-huh | uhhuh |
| huh-uh | uhuh |
For scoring purposes, all hesitation sounds will be considered to be equivalent. Thus all reference transcription words beginning with "%", the hesitation sound flag, along with the conventional set of hesitation sounds, will be mapped to "%hesitation".
The system output transcriptions should use any of the hesitation sounds (without "%") when a hesitation is hypothesized. The set of hesitation sounds for the current evaluation is defined to be:
Along with each word output by a system, a confidence measure is also required. This confidence measure is the system's estimate of the probability that the word is correct. While this might be merely a constant probability, independent of the input, certain applications and operating conditions may derive significant benefit from a more informative estimate that is sensitive to the input signal. This benefit will be evaluated by computing the mutual information (cross entropy) between the correctness of the system's output word and the confidence measure output for it, normalized by maximum cross entropy:
The time-marked hypothesis words for each test will be placed in a single file, called "<TEST_SET>.ctm". The CTM (Conversation Time-Mark) file format, is a concatenation of time marks for each word in each side of a conversation. Each word token must have a conversation id, channel identifier [A | B], start time, duration, case-insensitive word text, and a confidence score. The start time must be is seconds and relative to the beginning of the waveform file. The conversation id's for this evaluation will be of the form:
CONV_ID::= <SWB2_ID> | <CALLHOME_ID>
The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.
Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.
Included below is an example:
Create a directory identifying your site ('SITE') from the following list which will serve as the root directory for all your submissions:
You should place all of your recognition test results in this directory. When scored results are sent back to you and subsequently published, this directory name will be used to identify your organization.
For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free-form system identification string 'SYSID' chosen by you. Place all files pertaining to the tests run using a particular system in the same SYSID directory.
The following is the BNF directory structure format for Hub-5E hypothesis recognition results:
<SITE>/<SYSID>/<FILES>
where,
where,
Step 3: System Documentation, including execution times, and System Output Inclusion
For each test you run, a brief description of the
system (the algorithms) used to produce the results must be submitted
along with the results, for each system evaluated. (It is permissible
for a single site to submit multiple systems for evaluation. In this
case, however, the submitting site must identify one system as the
"primary" system prior to performing the evaluation.)
Your system description file should be placed in the 'SYSID' sub-directory which it pertains to and must be called, "sys-desc.txt".
Likewise, the time-marked hypothesis file, created in step 1, should be placed in the 'SYSID' sub-directory which it pertains to, and must be called, "<TEST_SET>.ctm". For this evaluation, the value for <TEST_SET> will be "english".
Step 4: Test Results Submission Protocol
Once you have structured all of your recognition results according to the above format, you can then submit them to NIST. Due to international e-mail file size restrictions, test sites are permitted to submit results to NIST using either email or anonymous ftp. Continental US sites may use either method, but international sites must use the 'ftp' method. The following instructions assume that you are using the UNIX operating system. If you do not have access to UNIX utilities or ftp, please contact NIST to make alternate arrangements.
E-mail method:
where,
Ftp method:
where,
This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp>'):
You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Alvin Martin at 'alvin.martin@nist.gov' notifying NIST of your submission. Please include the name of your submission file in the message.
Note:
| DevSet Release | Autumn 1996 |
| Commitment Deadline | March 3, 1997 |
| EvalSet Release | March 10, 1997 |
| Results Deadline | March 31, 1997 at 7:00 pm. EST |
| Workshop | May 13-15, 1997 Maritime Institute of Technology and Graduate Studies Lithicum, Maryland |