The index file distributed on the CD-ROM, has 10 minutes of data for each conversation, which is not consistent with the evaluation plan.
There is a clear and well-defined list of allowable hesitations and other non-lexemes for each language.
Non-lexemes are marked with the '%' symbol. Anything in the transcripts without this marking is judged to be a real lexeme rather than a hesitation sound.
For example, in English the hesitation which could be spelled "%uh" is pronounced exactly the same as the indefinite article "a." It is dependent on the judgement of the transcribers if what they hear falls within the set of hesitations/non-lexemes or if it counts as a real word.
[ DOWNLOAD HESITATIONS ] as a text file.
GERMAN: We will make a list available as soon as we have access to the German lexicon. Plans are to divide compound words into their components.
SPANISH - NA
MANDARIN - NA