<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • RT-02 Software

    1.0 Introduction

    The RT-02 evaluation is the first in a series of new NIST administered evaluations, which combines both Speech-To-Text (STT) and metadata data (MD) annotation. RT-02 will make use of existing scoring tools to perform the evaluation. As such, the two tasks defined: Speech To Text and Speaker Segmentation and Identification, will still be evaluated independently.

    After RT-02, NIST will release new evaluation software that will have the flexibility to evaluate a wide variety of STT and metadata types that are combined into a single representation.

    2.0 Speech-To-Text Scoring Instructions

    Traditionally, NIST has referred to this as Automatic Speech Recognition (ASR), but we are adopting the term STT for the RT evaluation series. STT tasks will be evaluated similarly to past ASR tasks. The NIST scoring utilities, Tranfilt and SCTK, will be used to normalize the transcripts prior to scoring and the transcripts will be scored with the SCTK utility SCLITE.

    Past scoring conventions regarding overlapping speech, hesitations and other speech phenomena will be followed this year. Consult the RT-02 evaluation plan for specific details.

    2.1 Required Scoring Utilities

    Two software packages, two supporting scripts, the UTF DTD, and a global mapping file needed to evaluate the output of an STT system. The resources are available via ftp from the following locations:

    • SCTK is the NIST Scoring Toolkit SCTK. It contains the sclite scoring engine.
    • The Tranfilt transcription filtering package is used by the hubscr script below to normalize the orthographies prior to scoring.
    • The scoring script, hubscr05.pl, use the SCTK and Tranfilt packages to score an STT system output.
    • The Universal Transcript Format (UTF) transcription filter 'utf_filt' reads a UTF-encoded transcript to produce STM formatted reference files used by the SCTK package. UTF specifies the form of the reference transcript to be evaluated against the system output. If you don't have the SGMLS perl package installed on your system, download the perl module also.
    • The UTF DTD describes the format for UTF documents
    • en20010117_hub5.glm is the most recent global mapping file (GLM). GLMs are needed by the scoring scripts to normalize spelling variants and pre-format the system output.

    2.2 System Output Formatting

    The RT-02 STT system output format uses the same CTM format as used in previous Hub-4 and Hub-5 evaluations.

    The ctm file format is a concatenation of time mark records for each word in each channel of a waveform. The records are separated with a newline. Each word token must have a waveform id, channel identifier (matching the channels of the reference file), start time, duration, and word text. Optionally a confidence score can be appended for each word.

    Consult the CTM documentation on the NIST website for complete details.

    2.3 Example Invocation

    There is a scoring example compressed tar archive available at the for the NIST ftp server. The package contains a readme file showing how the utilities are used to score an STT system.

    3.0 Metadata Annotation Scoring Instructions for Speaker Segmentation

    The speaker segmentation and identification metadata annotation task will be evaluated using the same procedures as defined in the N-Speaker Segmentation condition for the NIST 2001 Speaker Recognition Evaluation. From the 2001 Speaker Recognition Evaluation Plan Section 2.1.4, the task is to identify the time intervals during which unknown speakers are each speaking in a recording. The speakers are unknown implying there is no assumed pool of speakers to identify and therefore this task does not require cross-recording speaker mapping.

    The evaluation will use the same system evaluation Segmentation Error Rate as defined in Section 3.3 of the 2001 Speaker Recognition Evaluation Plan Section.

    3.1 Required Scoring Utilities

    The speaker segmentation scoring utilities are available from the URL ftp://jaguar.ncsl.nist.gov/pub/seg_scr.v21.tar.Z. The package contains example input files an instructions for how to run the program.

    3.2 System Output Formatting

    The segmentation scoring software takes as input an index file of segmentation files to score. The index file is an ASCII file containing a list of records. Each newline-separated record identifies corresponding hypothesis and reference segmentation files to score. The index file is formatted as follows:

    HYP_FILE_NAME REF_FILE_NAME
    HYP_FILE_NAME REF_FILE_NAME
    ...

    Both the system-generated segmentation files and reference segmentation files use the same format. The files are formatted as lists of segmentation records. Each record indicates the start and end times for a speaker, and the speaker id to which the interval is attributed. The file is formatted as follows:

    START_TIME END_TIME SPEAKER_ID
    START_TIME END_TIME SPEAKER_ID

    where:

    • START_TIME: The starting interval time (to the hundredth of a second).
    • END_TIME: The ending interval time (to the hundredth of a second).
    • SPEAKER_ID: The speaker cluster this segment belongs to. This can be any string.

    3.3 Example Invocation

    The segmentation software distribution readme includes an example system and reference segmentation files along with examples of how to run the scoring program.

     

     

    Page Created: September 17, 2007
    Last Updated: November 4, 2008

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA