<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology

  • Multimodal Information Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • Metadata Annotation Experiment

    1.0 Purpose

    An initial experiment was run in January, 2002 to explore metadata annotation as relevant to the Rich Transcription Evaluation and DARPA EARS program. The results of this experiment were used by NIST in defining the metadata annotation portion of the RT-02 evaluation and will be used in planning for future RT evaluations. Since this is still an open area of research, we continue to welcome additional submissions.

    The experimenter may propose one or more metadata annotation types by describing the annotation type(s) and hand-annotating the proposed types on the specified experimental data (see below). Proposed annotation types should satisfy the following criteria:

    • to be relevant to the need for the production of human-usable automatically-produced transcriptions of speech,
    • to be able to be automatically produced with some level of accuracy by near-future systems, and
    • to be able to be consistently and efficiently hand-annotated to create a gold standard for evaluation.

    2.0 Experiment Results

    Researchers were encouraged to work with the sample source data (audio files and associated transcriptions) to create a set of metadata annotation definitions and sample annotations which they believed were of interest for the RT metadata task. Several research sites submitted a variety of suggested types. The raw results of the experiment are given in this metadata annotation experiment results summary page. Using these suggestions, NIST created a putative set of initial metadata annotation types (speaker change/ID, acronym, verbal edit interval, named entity/type, numeric expression/type, and temporal expression/type) which we believed could be implemented this year. (Note that we had also conducted an earlier internal experiment which demonstrated that annotation of sentences or punctuation in spontaneous interactive speech was a very difficult task and required further study and we chose to defer exploration of those types.) However, given the very tight schedule, we decided to focus only on the detection of speaker changes and clustering within excerpt for this first evaluation. This would permit us to develop and implement an infrastructure for metadata annotation evaluation while continuing to study and discuss other metadata types of interest for implementation in future evaluations.

    3.0 Experimental Data

    The data for this experiment consists of three short excerpts of digitally recorded speech from each of three source types (news broadcasts, telephone conversations, and meetings) for a total of nine excerpts of about 20 minutes total duration. Since orthographic transcription is not the focus of this exercise, the orthography has been provided. Please annotate around the given orthography (even if you disagree with it). Since speaker segmentation is an obvious metadata type of interest and is relatively non-controversial to annotate, we are pre-annotating the experimental data with this information in the prescribed format to provide a concrete example of type submissions we expect. The audio data and transcripts are available at metadata annotation expriment data samples page.

    4.0 Submission Formats

    You must submit both a data type definition for the proposed metadata type and fully-annotate all of the provided experimental data with the proposed metadata type.

    4.1 Data Type Definition

    Complete the following template for each proposed metadata type:

    • Type name:
    • Type attributes and allowable attribute values:
    • Description:
    • Justification (why this type is pertinent to the above criteria):
    • Rules for annotation:
    • Other Notes:


    • Type name:
      • speaker
    • Type attributes and allowable attribute values:
      • "id" (required): Values: spkr_(1... n)
      • "starttime" (required): Floating point number of seconds
      • "endtime" (required): Floating point number of seconds
    • Description:
      • This metadata type is used to identify when speaker changes occur in the audio stream, who each identified speaker is relative to that session, and when the speakers begin and finish speaking.
    • Justification:
      • This attribute will permit the labeling of speakers and turns in automatically-produced transcriptions. It is obviously important to know who said what in multi-speaker recordings, and that will facilitate the use of this information in downstream processing using speaker-specific models, etc.
    • Rules for annotation:
      • A <speaker> tag is inserted whenever a change of speaker is detected in the audio stream. A </speaker> tag is inserted when the speaker finishes the turn or has a lengthy pause interrupted by another speaker in which case a new <speaker> tag is generated for the remainder of the turn. The speaker is identified by the "ID" attribute which is assigned sequentially spkr_(1... n) as new speakers are noted in the recording.
    • Other notes:
      • None

    4.2 Experimental Data Format

    Note that although time will likely be the primary unit for metadata annotation for RT-2002, for this experiment please provide inline tagging of your proposed metadata type in the orthography we have provided. Ignoring time is likely to save a considerable amount of effort for both NIST and you.

    When doing your annotation, start with the speaker tagged data provided in the examples directory and add your tags inline with the text. If you propose multiple annotation types, you may either submit one version of each experimental data file per annotation type, or put multiple annotation types in each experimental data file.

    The marked up files need not adhere to the SGML requirement for tag nesting and may contain overlapping tags. Please be clear in naming your files and document your organization clearly in the email you send us indicating a list of the files you are submitting and what they contain.

    Example: (Note that we have included times in this example. Your annotations need not include times, just proper tag positioning relative to the orthography)

    <speaker id='spkr_1' starttime='0.36' endtime='23.10' > tonight this thursday big pressure on the clinton administration to do something about the latest killing in yugoslavia airline passengers and outrageous behavior at thirty thousand feet what can an airline do and now that el nino is virtually gone there is la nina to worry about one hot and one cold we'll take a closer look </speaker>

    <speaker id='spkr_2' starttime='28.65' endtime='35.88'> from a b c news world headquarters in new york this is world news tonight with peter jennings </speaker>

    <speaker id='spkr_1' starttime='35.88' endtime='71.74'> good evening we begin this evening with american threats to get more deeply involved against the serbs breath in what is left of yugoslavia emphasis on threats tonight the clinton administration and its allies in the north atlantic treaty organization are trying to stop the serb campaign of violence in the southern province of kosovo without having to get themselves too deeply involved it is very tricky the serbs are trying to crush an independence movement in kosovo and the serbian leader is not easily restrained given his promises in the past it's also a dilemma for mr. clinton first here's a b c's john mcwethy </speaker>

    <speaker id='spkr_3' starttime='71.74' endtime='91.85'> with villages in kosovo going up in smoke nato defense ministers in brussels issued the toughest warning yet to yugoslav president slobodan milosevic stop the killing of ethnic albanians who live in kosovo withdraw your heavy forces and begin peace talks or face the possibility of military action by the west </speaker>

    <speaker id='spkr_4' starttime='91.85' endtime='99.00'> today we took some important steps to ensure that mr. milosevic knows that his indiscriminate use of force is unacceptable. </speaker>



    Page Created: September 18, 2007
    Last Updated: November 4, 2008

    Multimodal Information Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA