This format contains a proposal for the common exchange of non-lexical information to support optional experiments in the SDR Track. Our hypothesis is that non-lexical information that can be automatically extracted from the audio signal may be helpful in performing segmentation for retrieval in the unknown story boundaries condition. Currently, most sites are using words output by automatic speech recognizers and pause information as input to their retrieval systems, but not other possibly useful information such as speaker changes, noise changes, music, commercials, etc.
As such, we are providing a standard, named ".sdt" (Segmentation Detection Table) for the exchange of this type of data so that sites can explore the utility of incorporating it into their search system. Sites with non-lexical extraction abilities are encouraged to share their data with other retrieval sites in this format.
Sites wishing to perform experiments using these data will be required to execute runs without such side information as a control.
The proposed representation is now interval-based (i.e. attributes are recorded as they change in the signal). All interval types must be assumed to be independent (i.e. a particular attribute is assumed not to influence any other attributes recorded in this file). Therefore, systems can record multiple interval types in the same file.
<FILE_ID><WS><TYPE><WS><START_TIME><WS><END_TIME><WS><CONFIDENCE>
TYPE specifies the type of the event chosen from the published list.
The preliminary list developed by NIST is given below. If sites wish
to use additional types not listed below, they must register their type
with NIST. This will ensure that different sites who are detecting
the same event types are using the same nomenclature.
S_TIME specifies the time at which the interval begins and is consequently recorded, measured in seconds from the beginning of the signal with a precision of a hundredth of a second.
E_TIME specifies the time at which the same interval ends measured in seconds from the beginning of the signal with a precision of a hundredth of a second.
CONFIDENCE specifies the confidence level provided by the system as
a normalized percentage (floating point number between 0 and 1)
examples : 0.6, 0.12, 0.483
OPTSECTION ::= <OPTFIELD><WS><OPTFIELD<WS>....<OPTFIELD>
VALUE is the quantity, ordinal or descriptor given to the attribute
and may contain any character except a double quote (")
Examples:
| File_ID | Type | S_Time | E_Time | Confidence | Optional_attributes |
| 199980630_2130_2200_CNN_HDL | svolume | 129.56 | 132.48 | 0.638 | level="10" |
| 199980630_2130_2200_CNN_HDL | bandwidth | 132.48 | 225.36 | 0.46 | type="narrow" |
| 199980630_2130_2200_CNN_HDL | speaker | 225.36 | 227.43 | 0.743 | spk_id="sid_1" |
| 199980630_2130_2200_CNN_HDL | language | 227.43 | 231.26 | 0.682 | iso639="en-US" |
| 199980630_2130_2200_CNN_HDL | commercial | 231.79 | 240.24 | 0.876 |
SDT Interval Template: Interval type: <INTERVAL-NAME> ID: <INTERVAL-ID> (short alphanumeric identifier to be used in SDT) Description: (short phrase describing detected non-lexical information and rules for it) Attributes: (List of attributes and rules for generating them) Name: (short alphanumeric identifier) Format: (string, integer, or float) Values: (range or list in the format defined) Rules: (how to characterize the attribute) |
Interval type: speaker Notation: speaker Description: indicate the charcteristics of the main speaker Attributes: Name: spk_id Format: string Values: character stream ? for unknown Rules: The "spk_id" is a character stream identifying the speaker.
Interval type: gender Notation: gender Description: indicate gender Attributes: Name: gender Format: string Values: M if Male F if Female ? if unknown gender Rules: None
Interval type: story boundary marker Notation: story Description: story boundaries as hand annotated in TDT/SDR per the TDT story segmentation Specification Attributes: None
Interval type: topic boundary marker Notation: topic Description: topic is topically cohesive excerpts that may or may not be equivalent to stories. Participants should define what is their definition of "topic" type. Attributes: None
Interval type: bandwidth ID: bandwidth Description: indicate the channel bandwidth Attributes: Name: type Format: string Values: narrow, wide Rules: The "bandwidth" interval is to be present when a significant degradation or an enhancement of the bandwidth of the channel is detected. As a consequence, "type" tries to classify its quality during broadcast and recording. "narrow" for phone-bandwidth (band-limited). "wide" for studio-bandwidth (non-band-limited).
Interval type: speech volume ID: svolume Description: level of the audio signal of the primary speaker Attributes: Name: value Format: float Values: 0 to 9 Rules: The "value" characterize the volume of the speech. 0 for silence, 9 for very loud
Interval type: energy ID: energy Description: coarse categorisation of the energy in the signal Attributes: Name: level Format: string Values: 0 to 9 Rules: exact rule of what a "level" consists of (e.g. duration, dB level and reference)for any "level" used, must be specified as a comment at the beginning of the file.
Interval type: background speech ID: bspeech Description: background speech which is superposed to the primary speech. Attributes: Name: value Format: float Values: 0 to 9 Rules: The "background speech" interval is produced as soon as a significant increase of the background speech level power can be detected. As such, the "value" describes the level of the background SPEECH present in the audio-signal. Participants may decide how the levels they detect will fit within the provided scale. 0 for no background speech, 9 for loud background speech.
Interval type: background noise ID: bnoise Description: Note, this can include both music and any generic "noise" intervals, such as applause, helicopter noise, noise from machinery, noise from animals etc. Attributes: Name: value Format: float Values: 0 to 9 Rules: "value" is representative of the level of the background NOISE present in the audio-signal. 0 for no background noise, 9 for loud background noise.
Interval type: no-speech ID: nospeech Description: identifies areas of the audio which does not contain any speech at all. Attributes: None
Interval type: silence ID: silence Description: no significant energy present in the signal. no speech AND no background noise in the signal. Attributes: None
Interval type: music ID: music Description: presence of music Attributes: Name: value Format: float Values: 0 to 9 Rules: The "value" characterize the music level in the audio signal. Participants may decide how the levels they detect will fit to the provided scale. 0 for no music, 9 for loud music.
Interval type: language ID: language Description: language the main speaker is using following ISO-639 (+ ISO-3166) two-letter codes Attributes: Name: type Format: string Values: "en-GB" for English "en-US" for American English "foreign" for other Foreign language Rules: Below are the links to have more information about language types and standards : http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt http://www.isi.edu/in-notes/iana/assignments/country-codes
Interval type: sentence boundary ID: sentence Description: Location of end/start of a sentence Attributes: None
Interval type: repetition ID: repeat Description: identical (exact match) words, speaker, background condition, music etc to previous audio. Attributes: None
Interval type: commercial ID: commercial Description: commercial advertisement Attributes: None
Back to top