The 1998 Hub-5 Evaluation Plan for Recognition of Conversational Speech over the Telephone, in English

Version 3.0, 28-Jul-98

Introduction

The 1998 Hub-5 evaluation is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of conversational speech recognition. To this end the evaluation was designed to be simple, to focus on core speech technology issues, to be fully supported, and to be accessible.

This year's Hub-5 evaluation will be solely on conversational telephone data in English, and is dedicated to the advancement of speech recognition technology for English, and specifically for General American English. Word error rate will be the primary evaluation metric, as in the past. This year, however, a weighted word error rate will also be computed, and there will be a named entity task included in the evaluation, as described below.

The 1998 Hub-5 evaluation will be conducted in August and September. (Data will go out on August 14, and results are due back by September 4.) A follow-up workshop for evaluation participants will be held in late September (September 24th - 25th) to discuss research findings. Participation in the evaluation is solicited for all sites that find the task worthy and the evaluation of interest. For more information, and to register a desire to participate in the evaluation, please contact Dr. Alvin Martin at NIST, alvin.martin@nist.gov. Please note that the committment deadline for sites participating in this evaluation is August 7, 1998.


Technical Objective

The Hub-5 evaluation focuses on the task of transcribing conversational speech into text. This task is posed in the context of conversational telephone speech in General American English. The evaluation is designed to foster research progress, with the goals of

  1. exploring promising new ideas in the recognition of conversational speech,
  2. developing advanced technology incorporating these ideas, and
  3. measuring the performance of this technology.


The Task

The task is to transcribe conversational speech. The speech to be transcribed is presented as a set of conversations collected over the telephone. Each conversation is represented as a "4-wire" recording, that is, with two distinct sides, one from each end of the telephone circuit. Each side is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit mu-law encoding).

Each conversation is represented as a sequence of "turns", where each turn is the period of time when one speaker is speaking. Each successive turn results from a reversal of speaking and listening roles for the conversation participants. The transcription task is to produce the correct transcription for each of the specified turns. The beginning and ending times1 of each of these turns will be supplied as side information to the system under test. This information, stored in a single PEM file2, will determine the test material.

Speech Data

Transcription Conventions

The American Heritage Dictionary (AHD)3will serve as the standard reference for word spellings. Words that don't occur in the AHD will be spelled using the most common accepted spelling.

Hesitation sounds, referred to as "non-lexemes", will be represented with a leading "%" character. Although these sounds are transcribed in a variety of ways due to highly variable phonetic quality, they are all considered to be functionally equivalent from a linguistic perspective

Training Data

The entire SwitchBoard-1 Corpus, and the first 100 conversations of the Call_Home English training corpus, may be used for training. Also the 40 Call_Home English conversation that were designated as the Spring 1996 and Spring 1997 development sets may be used for training. Also the entire Switchboard-2 Phase-1 Corpus, except the 20 conversations reserved for the development set (see below) may be used for training. These corpora are available from the LDC. Additional data may also be used for training, provided that the data are publicly available at the time of reporting results.

Development Data (the DevSet)

The March 1997 Call_Home English EvalSet (20 conversations) and Switchboard-2 Phase-1 EvalSet (20 conversations) will serve as the twin DevSets. Segment time marks and corresponding SNOR transcriptions for these data will be provided in standard STM4 (segment time marked) format. The file names for these 40 conversations are as follows:

CALLHOME ENGLISH

1. en_4042

6. en_4473

11. en_4775

16. en_4873

2. en_4215

7. en_4495

12. en_4796

17. en_5153

3. en_4334

8. en_4606

13. en_4812

18. en_5853

4. en_4339

9. en_4637

14. en_4834

19. en_6103

5. en_4368

10. en_4763

15. en_4859

20. en_6310

 

SWITCHBOARD-2 PHASE-1

1. sw_10007

6. sw_11300

11. sw_12148

16. sw_13459

2. sw_10022

7. sw_11356

12. sw_12353

17. sw_13476

3. sw_10094

8. sw_11408

13. sw_13008

18. sw_13495

4. sw_10661

9. sw_11632

14. sw_13082

19. sw_13651

5. sw_11127

10. sw_11778

15. sw_13105

20. sw_13659



In addition, smaller subsets of each of these DevSets are defined for the purpose of facilitating exchange of research results between sites. These consist of the first 30 seconds of speech (from the 5-minute evaluation excerpt) for each conversation side of all conversations. The actual elapsed time will be at least 30 seconds duration, in order to capture 30 seconds of speech. (In some cases the elapsed time may reach the limit of 5 minutes if the speaker mostly listens and doesn't say much.) Segment time marks and corresponding transcriptions for these subsets will be provided in standard STM format

Evaluation Data (the EvalSet)

The EvalSet will comprise 20 conversations each from Call_Home English and Switchboard-2 Phase-2, for a total of 40 conversations. Whole conversations will be supplied, but recognition will be scored only for the 5-minute excerpt chosen by the LDC for transcription from each conversation. Speaker turn segmentation information for these 5-minute excerpts will be supplied to guide the recognition system. This segmentation information will be supplied in NIST's PEM file format.

While the Evaluation data will come from two different sources (namely Call_Home and Switchboard-2), the identity of the source is not to be provided to the system under test. The system must either recognize the speech irrespective of the source or must automatically determine the source from examination of the speech signal.

The Switchboard-2 conversations will be provided after the application of an echo cancellation algorithm. This algorithm is being used to maximize the separation of the two sides of each conversation. The echo cancellation software and documentation are available from Mississippi State's ISIP laboratory5 via anonymous ftp.

Version 2.5 of the echo cancellation software will be used. A Perl wrapper script to process SPHERE-headered speech files will also be provided.

Echo cancellation will not be performed on the Call_Home English data. The reason is that the Call_Home data were collected from a telephone system that already had (nonlinear) echo suppression incorporated in it.

The Evaluation

Each system will be evaluated by measuring that system's word error rate (WER). Each system will also be evaluated in terms of its ability to predict recognition errors. System performance will be evaluated over an ensemble of conversations. These conversations will be chosen to represent a statistical sampling of conditions of evaluation interest. For the 1998 evaluation these conditions will include sex, geographical distribution and age.

The Reference Transcription

The reference transcriptions are intended to be as accurate as possible, but there will necessarily be some ambiguous cases and outright errors. In view of the existing high error rates of automatic recognizers on this type of data, it is not considered cost effective to generate multiple independent human transcriptions of the data or to have a formal adjudication procedure following the evaluation submissions.

The reference transcription for each turn will be limited to a single sequence of words. This word sequence will represent the transcriber's best judgment of what the speaker said.

Word fragments will be represented by an initial part of a word with a hyphen at the end. Correct recognition will consist of either ignoring the fragment, or producing a word of which the fragment is an initial part.

The reference transcription will contain no hyphenated words. Each hyphenated word will be separated into its separate constituent words

The WER Metric

Word error rate is defined as the sum of the number of words in error divided by the number of words in the reference transcription. The words in error are of three types, namely substitution errors, deletion errors, and insertion errors. Identification of these errors results from the process of mapping the words in the reference transcription onto the word in the system output transcription. This mapping is performed using NIST's SCLITE software package6.


Scoring will be performed by aligning the system output transcription with the reference transcription and then computing the word error rate. Alignment will be performed independently for each turn, using NIST's SCLITE scoring software. The system output transcription will be processed to match the form of the reference transcription. Hyphenated words will be separated into their separate constituent words.

Some variant spellings of the same word exist in the transcriptions. These words, with or without hyphens, will be mapped onto a single preferred word spelling without hyphens. The set of all such mappings is:

Input

Output

mhm

uhhuh

mmhm

uhhuh

mm-hm

uhhuh

mm-huh

uhhuh

huh-uh

uhuh

 

For scoring purposes, all hesitation sounds will be considered to be equivalent. Thus all reference transcription words beginning with "%", the hesitation sound flag, along with the conventional set of hesitation sounds, will be mapped to "%hesitation".

The system output transcriptions should use any of the hesitation sounds (without "%") when a hesitation is hypothesized. The set of hesitation sounds for the current evaluation is defined to be:

"uh", "um", "eh", "mm", "hm", "ah", "huh", "ha", "er", "oof", "hee", "ach", "eee" and "ew".

Weighted Word Error Rate

An information theory based weighted word error that emphases the less frequently occurring words will be determined along with the standard word error rate.

Named Entities

This evaluation will include a "black box" named entity component implemented using the GTE/BBN Technologies IndentiFinder (TM) program to add named entity tags to systems' output. This will then be scored using the MITRE 'mscore' and the NIST 'aldistsm' programs. The script and documentation for these are available from the URLs

ftp://jaguar.ncsl.nist.gov/lvcsr/sru_tools/ne_scr01.pl

ftp://jaguar.ncsl.nist.gov/lvcsr/sru_tools/ne_scr01.htm

NIST will determine the values of two metrics for evaluating the named entity results, namely the F-ratio and the slot error rate, a measure computed analogously to word error rate, as proposed by John Makhoul.

The Confidence Score

Along with each word output by a system, a confidence measure is also required. This confidence measure is the system's estimate of the probability that the word is correct. While this might be merely a constant probability, independent of the input, certain applications and operating conditions may derive significant benefit from a more informative estimate that is sensitive to the input signal. This benefit will be evaluated by computing the mutual information (cross entropy) between the correctness of the system's output word and the confidence measure output for it, normalized by maximum cross entropy:


equation

Submission of Results

Results must be submitted to NIST by September 4, 1998 at 7:00 PM. EST using the following steps:

    1. system output file creation,
    2. directory structure creation,
    3. system documentation, including execution times, and system output inclusion
    4. transmission protocol to NIST.


Step 1: System output file creation

The time-marked hypothesis words for each test will be placed in a single file, called "<TEST_SET>.ctm". The CTM (Conversation Time-Mark) file format is a concatenation of time marks for each word in each side of a conversation. Each word token must have a conversation id, channel identifier [A | B], start time, duration, case-insensitive word text, and a confidence score. The start time must be in seconds and relative to the beginning of the waveform file. The conversation id's for this evaluation will be of the form:

CONV_ID::= <SWB2_ID> | <CALLHOME_ID>

where,

SWB2_ID ::= sw_DDDDD (where DDDDD is a five digit conversation code)
CALLHOME_ID ::= en_DDDD (where DDDD is a four digit conversation code)

The file must be sorted by the contents of the first three columns: the first and the second in ASCII order, the third in numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.

Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.

Included below is an example:

;;
;; Comments follow ';;'
;;
;; The Blank lines are ignored

;;
en_7654 A 11.34 0.2 YES -6.763
en_7654 A 12.00 0.34 YOU -12.384530
en_7654 A 13.30 0.5 CAN 2.806418
en_7654 A 17.50 0.2 AS 0.537922
:
en_7654 B 1.34 0.2 I -6.763
en_7654 B 2.00 0.34 CAN -12.384530
en_7654 B 3.40 0.5 ADD 2.806418
en_7654 B 7.00 0.2 AS 0.537922
:


Step 2: Directory Structure Creation

Create a directory identifying your site ('SITE') from the following list, which will serve as the root directory for all your submissions:

You should place all of your recognition test results in this directory. When scored results are sent back to you and subsequently published, this directory name will be used to identify your organization.

For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free-form system identification string 'SYSID' chosen by you. Place all files pertaining to the tests run using a particular system in the same SYSID directory.

The following is the BNF directory structure format for Hub-5 hypothesis recognition results:

<SITE>/<SYSID>/<FILES>

where

SITE ::= bbn | dragon | ibm | sri | . . .
SYSID ::= (short system description ID, preferably <= 8 characters)
FILES ::=

sys-desc.txt

(system description, described below, including reference to paper if applicable)


<TEST_SET>.ctm

(file containing time-marked hypothesis word strings created in Step 1)

where

TEST_SET ::= english

Step 3: System Documentation, including execution times, and System Output Inclusion

For each test you run, a brief description of the system (the algorithms) used to produce the results must be submitted along with the results, for each system evaluated. (It is permissible for a single site to submit multiple systems for evaluation. In this case, however, the submitting site must identify one system as the "primary" system prior to performing the evaluation.)

The format for the system description is as follows:

SITE/SYSTEM NAME
TEST DESIGNATION

  1. Primary Test System Description:
  2. Acoustic Training:
  3. Grammar Training:
  4. Recognition Lexicon Description:
  5. Differences for each Contrastive Test: (if any contrastive test were run.)
  6. New Conditions for This Evaluation:
  7. Execution Time:

    Sites must report the CPU execution time that was required to process the test data, as if the test were run on a single CPU. Sites must also describe the CPU and the amount of memory used.

  8. References:

Your system description file should be placed in the 'SYSID' sub-directory, which it pertains to and must be called, "sys-desc.txt".

Likewise, the time-marked hypothesis file, created in step 1, should be placed in the 'SYSID' sub-directory, which it pertains to, and must be called, "<TEST_SET>.ctm". For this evaluation, the value for <TEST_SET> will be "english".

Step 4: Test Results Submission Protocol

Once you have structured all of your recognition results according to the above format, you can then submit them to NIST. Due to international e-mail file size restrictions, test sites are permitted to submit results to NIST using either email or anonymous ftp. Continental US sites may use either method, but international sites must use the 'ftp' method. The following instructions assume that you are using the UNIX operating system. If you do not have access to UNIX utilities or ftp, please contact NIST to make alternate arrangements.

E-mail method:

First change directory to the directory immediately above the <SITE> directory. Next, type the following:

tar -cvf - ./<SITE> | compress | uuencode <SITE>-<SUBM_ID>.tar.Z | \
mail -s "September 98 Hub-5 test results <SITE>-<SUBM_ID>" \
alvin.martin@nist.gov

where

<SITE>

is the name of the directory created in Step 2 to identify your site.

<SUBM_ID>

The submission number (e.g. your first submission would be numbered '1', your second, '2', etc.)

Ftp method:

First change directory to the directory immediately above the <SITE> directory. Next, type the following command.

tar -cvf - ./<SITE> | compress > <SITE>-<SUBM_ID>.tar.Z

where

<SITE> is the name of the directory created in Step 2 to identify your site. <SUBM_ID> The submission number (e.g. your first submission would be numbered '1', your second, '2', etc.)

This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp'):

You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Alvin Martin at 'alvin.martin@nist.gov' notifying NIST of your submission. Please include the name of your submission file in the message.

Note:

If you choose to submit your results in multiple shipments, please submit ONLY one set of results for a given test system/condition unless you've made other arrangements with NIST. Otherwise, NIST will programmatically ignore duplicate files.

 

Schedule

DevSet Release

March 1997

Commitment Deadline

August 7, 1998

EvalSet Release

August 14, 1998

Results Deadline

September 4, 1998 at 7:00 PM. EST

Results Release

September 9, 1998

Workshop

September 24-25 1998
Maritime Institute of Technology and Graduate Studies
Linthicum, Maryland


Foot Notes

  1. These turn time marks will be specified in seconds (to the nearest millisecond) and will completely encompass the turn. Thus alternate turns will overlap if the speakers talk over each other.
    Back to text.

  2. The PEM ("partitioned evaluation map") file format is given in the SCLITE documentation available through NIST's web page (http://www.nist.gov/itl/div894/894.01/software.htm). Each record contains 5 fields: <filename>, <channel ("A" or "B")>, <speaker ("unknown">, <begin time> and <end time>.
    Back to text.

  3. The American Heritage Dictionary of the English Language, Book and CD ROM. Published October 1994 by Houghton Mifflin. ISBN 0395711460.
    Back to text.

  4. STM stands for "segment time marked". The STM file identifies time intervals along with the transcription for those intervals. At the time this document was prepared, the STM file format is documented in NIST's SCLITE scoring software distribution available via NIST's web page (http://www.nist.gov/itl/div894/894.01/software.htm).
    Back to text.

  5. Mississippi State's echo cancellation software is available via ftp access through the following address: ftp://ftp.isip.msstate.edu/pub/resources/technology/software/1996/fir_echo_canceller/ec_v2.5.tar.gz
    Back to text.

  6. SCLITE software is available via NIST's web page (http://www.nist.gov/itl/div894/894.01/software.htm).
    Back to text.