Segmentation Detection Table (SDT) Specification

Updated June 30th, 2000

Background

This document contains a specification for the format and content of Segmentation Detection Table (.sdt) files to be used for the exchange of automatically extracted non-lexical segmentation information. This format is being used for experiments in the 2000 TREC-9 Spoken Document Retrieval track.

This format contains a proposal for the common exchange of non-lexical information to support optional experiments in the SDR Track. Our hypothesis is that non-lexical information that can be automatically extracted from the audio signal may be helpful in performing segmentation for retrieval in the unknown story boundaries condition. Currently, most sites are using words output by automatic speech recognizers and pause information as input to their retrieval systems, but not other possibly useful information such as speaker changes, noise changes, music, commercials, etc.

As such, we are providing a standard, named ".sdt" (Segmentation Detection Table) for the exchange of this type of data so that sites can explore the utility of incorporating it into their search system. Sites with non-lexical extraction abilities are encouraged to share their data with other retrieval sites in this format.

Sites wishing to perform experiments using these data will be required to execute runs without such side information as a control.

The proposed representation is now interval-based (i.e. attributes are recorded as they change in the signal). All interval types must be assumed to be independent (i.e. a particular attribute is assumed not to influence any other attributes recorded in this file). Therefore, systems can record multiple interval types in the same file.

SDT File Format

The following specifies the format of the information to be included in an .sdt file. Information is entered in rows terminated by <CR><LF>. Each row contains one SDT record (no <CR> or <LF> is permitted within a row). An SDT record contains a required section followed by an optional section. Comment lines are allowed in the files provided that # sign is placed ahead the line. In addition, you can add newlines (<CR>+<LF>) as space lines between rows.

Required Section:

The required section of an SDT record contains 5 <WhiteSpace>-separated fields in the specified order. ALL of these fields must be included in EACH SDT record. Since these fields are separated by whitespace, they may not contain whitespace and must be of the form:

<FILE_ID><WS><TYPE><WS><START_TIME><WS><END_TIME><WS><CONFIDENCE>

Where,
FILE_ID specifies the name of the audio file annotated by this record


TYPE specifies the type of the event chosen from the published list. The preliminary list developed by NIST is given below. If sites wish to use additional types not listed below, they must register their type with NIST. This will ensure that different sites who are detecting the same event types are using the same nomenclature.

S_TIME specifies the time at which the interval begins and is consequently recorded, measured in seconds from the beginning of the signal with a precision of a hundredth of a second.

E_TIME specifies the time at which the same interval ends measured in seconds from the beginning of the signal with a precision of a hundredth of a second.

CONFIDENCE specifies the confidence level provided by the system as a normalized percentage (floating point number between 0 and 1)
examples : 0.6, 0.12, 0.483

If other features of an SDT Record are to be annotated, they should be recorded as data in the Optional Section (see below).

Optional Section:

An SDT record may also contain optional information specifying features of the detected event. The optional information section is separated from the required information by a <WS>. The optional information is to be presented in attribute/value pair combinations separated by a <WS> and enclosed in one set of square brackets of the form:

OPTSECTION ::= <OPTFIELD><WS><OPTFIELD<WS>....<OPTFIELD>

where,
OPTFIELD ::= <ATTRIBUTE>="<VALUE>"
where,
ATTRIBUTE is a character descriptor for the annotated feature (comprised of printable alphanumeric characters and underscores)


VALUE is the quantity, ordinal or descriptor given to the attribute and may contain any character except a double quote (")


Examples:

File_ID Type S_Time E_Time Confidence Optional_attributes
199980630_2130_2200_CNN_HDL svolume 129.56 132.48 0.638 level="10"
199980630_2130_2200_CNN_HDL bandwidth 132.48 225.36 0.46 type="narrow"
199980630_2130_2200_CNN_HDL speaker 225.36 227.43 0.743 spk_id="sid_1"
199980630_2130_2200_CNN_HDL language 227.43 231.26 0.682 iso639="en-US"
199980630_2130_2200_CNN_HDL commercial 231.79 240.24 0.876



Currently Defined Interval Types

The following lists some possible interval types (as suggested by the community) with their description and optional attributes. We invite sites to preuse this list and make suggestions regarding its content and to suggest new interval types. If you would like to broaden discussion, please cc your email to sdr_list@nist.gov.If you would like to suggest a new interval type, please send email in the following template to jerome.lard@nist.gov.

SDT Interval Template:

Interval type: <INTERVAL-NAME>

ID: <INTERVAL-ID>  (short alphanumeric identifier to be used in SDT)

Description: 
(short phrase describing detected non-lexical information and rules for it)
Attributes: (List of attributes and rules for generating them)
        Name: (short alphanumeric identifier)
        Format: (string, integer, or float)
        Values: (range or list in the format defined)
        Rules: (how to characterize the attribute)

Interval type: speaker 
Notation: speaker
Description: indicate the charcteristics of the main speaker 

Attributes: 
        Name: spk_id
        Format: string
        Values: character stream 
                ? for unknown    

        Rules: 
        The "spk_id" is a character stream
        identifying the speaker.

Interval type: gender 
Notation: gender
Description: indicate gender 

Attributes: 
        Name: gender
        Format: string
        Values: M if Male

                F if Female
                ? if unknown gender
        Rules: None

Interval type: story boundary marker 
Notation: story
Description: story boundaries as hand annotated in TDT/SDR per the TDT story segmentation
             Specification 

Attributes: None


Interval type: topic boundary marker 
Notation: topic
Description: topic is topically cohesive excerpts that may or may not be equivalent to stories.
             Participants should define what is their definition of "topic" type. 

Attributes: None

Interval type: bandwidth
ID: bandwidth
Description: indicate the channel bandwidth

Attributes: 

        Name: type
        Format: string
        Values: narrow, wide
        Rules: 
        The "bandwidth" interval is to be present when 
        a significant degradation or an enhancement of the 
        bandwidth of the channel is detected.

        As a consequence, "type" tries to classify its quality
        during broadcast and recording. 
        "narrow" for phone-bandwidth (band-limited).
        "wide" for studio-bandwidth (non-band-limited).

Interval type: speech volume
ID: svolume 
Description: level of the audio signal of the primary speaker

Attributes:
        Name: value

        Format: float
        Values: 0 to 9
        Rules: 
        The "value" characterize the volume of the speech.
        0 for silence, 9 for very loud

Interval type: energy
ID: energy
Description: coarse categorisation of the energy in the signal

Attributes: 
        Name: level

        Format: string
        Values: 0 to 9
        Rules:  
        exact rule of what a "level" consists of 
        (e.g. duration, dB level and reference)for any "level" used,
        must be specified as a comment at the beginning of the file.

Interval type: background speech
ID: bspeech
Description: background speech which is superposed to the primary speech.

Attributes:
        Name: value

        Format: float
        Values: 0 to 9
        Rules: 
        The "background speech" interval is produced as soon as
        a significant increase of the background speech level power
        can be detected.
        As such, the "value" describes the level of the background SPEECH 
        present in the audio-signal.  

        Participants may decide how the levels they detect will fit
        within the provided scale.
        0 for no background speech, 9 for loud background speech.

Interval type: background noise
ID: bnoise
Description: Note, this can include both music and any generic
             "noise" intervals, such as applause, helicopter noise, noise
             from machinery, noise from animals etc.
 
Attributes:
        Name: value

        Format: float
        Values: 0 to 9
        Rules: 
        "value" is representative of the level of the background NOISE 
        present in the audio-signal.  
        0 for no background noise, 9 for loud background noise.

Interval type: no-speech
ID: nospeech
Description: identifies areas of the audio which does not contain any
             speech at all. 

Attributes: None

Interval type: silence
ID: silence
Description: no significant energy present in the signal.
             no speech AND no background noise in the signal.

Attributes: None

                        

Interval type: music 
ID: music
Description: presence of music

Attributes:
        Name: value
        Format: float
        Values: 0 to 9
        Rules: 
        The "value" characterize the music level in the audio signal.
        Participants may decide how the levels they detect will fit
        to the provided scale.

        0 for no music, 9 for loud music.

Interval type: language
ID: language
Description: language the main speaker is using following ISO-639 (+ ISO-3166)
             two-letter codes

Attributes:
        Name: type
        Format: string
        Values: "en-GB" for English
                "en-US" for American English
                "foreign" for other Foreign language

        Rules:
        Below are the links to have more information about language types and standards :
                http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt
                http://www.isi.edu/in-notes/iana/assignments/country-codes

Interval type: sentence boundary
ID: sentence
Description: Location of end/start of a sentence

Attributes: None

Interval type: repetition
ID: repeat
Description: identical (exact match) words, speaker, background condition, 

             music etc to previous audio.

Attributes: None

Interval type: commercial
ID: commercial
Description: commercial advertisement

Attributes: None

Back to top