Petra Geutner, Michael Finke, Peter Scheytt, Alex Waibel and Howard Wactlar
This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well studied languages, the problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data, and 2.) the excessive vocabulary growth in heavily inflected languages that lead to unacceptable OOV-rates. We present a Serbo-Croatian large vocabulary system that achieves a 74% recognition rate, despite limited training data. Our system achieves this rate by a multipass strategy that dynamically adapts the recognition dictionary to the speech segment to be recognized by generating morphological variations (Hypothesis Driven Lexical Adaptation).
We will outline the bootstrapping and training process of the Janus Recognition Toolkit (JanusRTk) based broadcast news recognition engine: data collection, segmentation and labeling of the data according to different acoustic conditions, dictionary design, language modeling and training. The Hypothesis Driven Lexical Adaptation (HDLA) approach has been tested both on Serbo-Croatian and German news data and has achieved considerable recognition improvements. OOV-rates were reduced by 35-45%; on the Serbo-Croatian broadcast news data from 8.7% to 4.8% thereby also decreasing word error rate from 29.5% to 26%.
When transcribing broadcast news data in other languages than the most common and well studied ones, problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data (since closed captions are typically not available), and 2.) the excessive vocabulary growth in heavily inflected languages that lead to unacceptable OOV-rates. While full-form word entries lead to excessively large vocabularies (and OOV-rates), the use of morpheme-based dictionaries also offers little relief: the combination of arbitrary morphemic affixes by way of a morphemic language model leads to an overgeneration of illegally inflected word hypotheses and thus increases error rates. This is especially the case for languages like Serbo-Croatian and German. As Serbo-Croatian is characterized by rapid vocabulary growth due to a large number of possible word inflections, we have to deal with out-of-vocabulary rates between 5 and 13%. This makes OOV-words a major source of recognition errors in multilingual broadcast news.
We present a Serbo-Croatian large vocabulary system that achieves a recognition rate of about 74%, despite very limited acoustic and language modeling training data. Our Serbo-Croatian JanusRTk based recognizer was trained on 12.5 hours of recorded speech of read newspaper articles and 27 broadcast news shows. In the following we will outline the bootstrapping and training process: data collection, segmentation and labeling of the data according to different acoustic conditions, dictionary design, language modeling and training.
Focusing on the reduction of the high OOV-rate, this work presents a two-pass recognition approach where the first pass is used to dynamically adapt the recognition dictionary to the speech segment to be recognized. The basic idea is that a large number of words in the hypothesis are recognized incorrectly because only the inflection ending is wrong, but the word-stem is recognized correctly. Often the right word was not in the dictionary thus constituting an OOV-word. By applying our multipass strategy we generate morphological variations of dictionary words only in a focused fashion (Hypothesis Driven Lexical Adaptation), thus dynamically adapting the recognition vocabulary. A second recognition run is then carried out on the adapted vocabulary.
Our approach has been tested both on Serbo-Croatian and German news data and has achieved considerable recognition improvements. In the former the OOV-rate could be decreased by 45% from 8.7% to 4.8% and in the latter case the OOV-rate dropped by 35% from 9.3% to 6.0%. Word accuracy experiments have been performed on Serbo-Croatian Broadcast News data, where we observed a relative performance improvement of 12% from 29.5% to 26% word error with an adapted vocabulary.
Two Serbo-Croatian speech databases have been collected at the Interactive Systems Laboratory at the University of Karlsruhe: an 18 hour database consisting of read newspaper speech and a total of 18 hours of recorded and transcribed broadcast news shows.
The audio data for the first database, the dictation material was collected in Croatia and Bosnia-Hercegovina. Native speakers were asked to read 20 minutes of news texts extracted from the HRT (Croatian Radio and Television) web site and Obzdor Nacional, a Croatian newspaper. The speech was digitally recorded using a portable DAT-recorder at a sampling rate of 48 kHz in stereo quality and further sampled down to 16 kHz with 16 bit resolution in mono quality. The read utterances were checked against the original text to eliminate major errors and mark spontaneous effects. This data was originally collected as part of the GlobalPhone project at the University of Karlsruhe.
| # Speakers | # Articles | Recording Length | # Words |
|---|---|---|---|
| 85 | 131 | 18 h | 89.000 |
The broadcast news data was also collected at the University of Karlsruhe in Germany. A satellite dish and a dedicated PC, equipped with an MPEG encoder board, were installed to record the HRT evening news show which is transmitted from Croatia via the Eutelsat satellite. The television signal was digitally recorded in MPEG format (target bit rate: 1.008 Mbit/s, audio bit rate: 0.192Mbit/s, sampling rate: 44.1 kHz). For speech recognition the audio signal was uncompressed and sampled down to 16 kHz with 16 bit resolution. As no closed caption was available, transcription of the news broadcasts was done by native speakers. Similar to the HUB4 corpus for English broadcast news data the Serbo-Croatian recordings were divided into segments. Within these segments the acoustic conditions remained constant and each segment was tagged regarding channel quality, background noises and speaker. The various tags used in these three categories are shown in table 2, where ``Non-Serbo-Croatian'' identifies a person speaking in another language than Serbo-Croatian, most often English.
| Speaker | Channel | Noise |
|---|---|---|
| Male | Clean | Music |
| Female | Telephone | Second Speaker |
| Non-Serbo-Croatian | Distorted | Conference |
| Unknown | Unknown | Street |
| None | Static Noise | |
| Other | ||
| None |
In addition to these acoustic tags only the most frequent and clearly audible spontaneous effects were transcribed: Hesitation, breathing and some other human and non-human noises.
It took about 13 to 18 hours to transcribe a news broadcast of approximately 40 minutes because of
| Source | Broadcasts | Recording Length | # Words |
|---|---|---|---|
| HRT (MPEG) | 27 | 18 h | 118k |
| RFE/RL (RA) | 7 | 0.5 h | 7k |
| Total | 33 | 18.5 h | 125k |
In addition to the transcripts of the 27 news shows other sources of text data had to be collected to build up a sufficiently large corpus for language modeling purposes. Searching the internet, we retrieved text data from 20 different sources (television and radio stations, newspaper and news agencies). During text processing we encountered one major problem: Many sites simply map diacritics onto their corresponding non-diacritical letter, e.g. c and c both become c. In order to build language models based on the web portion of the text corpus we had to automatically invert this mapping.
A statistical approach was used to convert web texts to usable language model training texts with diacritics. In order to get a reliable conversion as many Serbo-Croatian texts with diacritics as were available were collected. From these texts a list of correct words was generated. This list served as reference to convert a second list . was extracted from the texts without special characters and contained both correct and false word forms. Our conversion algorithm works as follows:
In the test text 25% of the words were converted incorrectly using this mechanism. This allows a better conversion than just leaving the words as they are, which produces an error rate of 70%. Thus finally the combined conversion error rate of the whole algorithm on the test text was 5% and enabled us to use more than twice the amount of text training material than we had before.
| Character Set | Web Sites | # Words |
|---|---|---|
| Diacritics | 7 | 5 M |
| No Diacritics | 13 | 6 M |
| Total | 20 | 11 M |
For building a Serbo-Croatian broadcast news recognizer the Janus Recognition Toolkit (JRTk) [1] was used. Phone set and a pronunciation dictionary were generated almost automatically, as Serbo-Croatian orthography closely matches its pronunciation. As a consequence the phone set corresponds almost exactly to the alphabet and consists of 30 phones, 4 noise and 1 silence models. The pronunciation dictionary was created by an automatic grapheme-to-phoneme tool. Some manual adjustments were necessary for numbers, abbreviations, foreign words and names.
Each phone is modeled by a left-to-right HMM with 16 diagonal Gaussians. The preprocessing of the system consists of extracting Mel-frequency cepstral coefficients every 10 ms. The final feature vector is computed by a truncated LDA transformation of a concatenation of MFCCs and their first and second order derivatives. Vocal tract length normalization and cepstral mean subtraction are used to extenuate speaker and channel differences.
A first context-independent Serbo-Croatian dictation system was trained using the labels generated by a speaker-adapted German recognizer (label boosting [2]). The Serbo-Croatian phones were initialized by their closest German equivalents. A backoff trigram language model built on the very few available training transcriptions was used. The labels rewritten by the first Serbo-Croatian trained recognizer turned out to be more accurate and were used to train a context-dependent dictation system.
With a vocabulary size of 18k words, speaker-dependent VTLN, MLLR adaptation during testing and the use of interpolated language models from different corpora, initial system performance improved to 28.2% word error rate on the read newspaper test set using our dictation engine D1.
| Vocabulary Size | OOV-Rate | Word Error | |
|---|---|---|---|
| read data | 18k | 8.5% | 28.2% |
| broadcast news | 18k | 22.2% | 73.6% |
For the broadcast news domain the test set consists of acoustic segments from two news broadcasts. Results reported below correspond to the English PE (partitioned evaluation) test set in the last HUB4 evaluation (December 1996) in which the segments and their constant acoustic properties were given for training and testing.
A first test run on the baseline system with 28.2% word error rate on the dictation data resulted in 73.6% word error rate on the broadcast news test set (see tables 5 and 6). This was mainly due to the noisy conditions even in the clean segments. The baseline dictation system was used to label our broadcast news data and train a first recognizer (B0) on only 10 hours of transcribed recordings. This context-dependent system was set up with 2k codebook vectors over 24 input features. The vocabulary size was 29k, the OOV-rate 14.0%.
| System | Vocabulary Size | OOV-Rate | Word Error |
|---|---|---|---|
| D1 | 18k | 22.2% | 73.6% |
| B0 | 29k | 14.0% | 43.6% |
| B4 | 31k | 13.6% | 36.0% |
| B5 | 49k | 8.7% | 29.5% |
As interpolation of different language models resulted in performance improvements with system D1 on the dictation task, we expected the same for our broadcast news recognizer. Compared to using a single language model an absolute improvement of 1.6% word error rate was made. As we had collected text corpora from about 20 different sites, we applied three criteria to divide our text data into different sets: Geographical origin (Serbia vs. Croatia), content source (television and radio stations vs. newspaper and news agencies) and language model perplexity. An interpolation of three different text corpora yielded the best recognition results.
Further improvements were made by weighted combination of the training data of the dictation data and broadcast news data (B4). We augmented the vocabulary (31k), which slightly reduced the OOV-rate to 13.6%. The performance of the recognizer trained on those data with about twice as many parameters (mixture of gaussians) as B0 was measured to be 36% word error rate (see table 6).
Our final Serbo-Croatian broadcast news recognizer (B5) was trained on 12.5 hours dictation data and 18 hours of transcribed news shows. The context dependent system is based on 4000 quinphone models. The preprocessing of the system consists of extracting an MFCC based feature vector every 10 ms with a window size of 20 ms. The final 32-dimensional feature vector is computed by a truncated LDA transformation of a concatenation of 13 MFCCs, the energy value, their first and second order derivations, plus zero crossing. Compared to the previous B4 system, two major changes were made: 1.) text material used for language model training was normalized, 2.) the vocabulary size of the recognition dictionary was increased from 31k to 49k.
In Serbo-Croatian up to three different dialectic variations of one word can be found, e.g. for the English word 'river', the Serbian variant 'reka', but also the Croatian variants 'rjeka' and 'rijeka' exist (see table 7). When normalizing all available
text material the latter two variants were replaced by the first one in all texts (language model corpora, training and test data) and added as pronunciation variants into the dictionary. Increasing the vocabulary size from 31k to 49k led to an OOV-rate of 10.1% instead of 13.6% on the unnormalized data, and normalization further reduced the number of OOV-words to 8.7%.
Let be the maximum number of words a speech recognition engine can handle in decoding. For speed and memory reasons this number is
limited in current state-of-the-art recognizers to be somewhere in the range of 20k to 60k words. Constraining the maximum number of words can be considered acceptable when building recognizers for languages like English, where the number of out-of-vocabulary words given N=60k vocabulary is below one percent. With error rates for tasks like broadcast news or conversational speech (Switchboard) between 30% and 40% (due to highly disfluent speech, noisy environment, and overlapping speech, music etc.) an OOV-rate of less than a percent is not considered a major or significant source of errors (Rule of thumb: one OOV-word causes about additional errors).
As shown above for Serbo-Croatian, for languages other than English the picture is very different. In order to achieve reasonable automatic transcription performance in the broadcast news domain for languages that are characterized by rapid vocabulary growth due to a large number of possible word inflections (for Serbo-Croatian see table 9 and for German table 10), we have to expect out-of-vocabulary rates between and . Figure 1 shows the number of words as a function of the number of tokens in broadcast news data for both German and Serbo-Croatian. In figure 2 we compare the self-coverage and cross-coverage as measured on newspaper and broadcast news text corpora.
A cheating experiment can be performed, pretending all information about the news vocabulary of a certain day would be accessible. Even when all important keywords of the day of transmission of the news broadcast would be known, the OOV-rate would only decrease to 7.8% in Serbo-Croatian news. The same cheating experiment as described above, was done on German data and resulted in an OOV-rate of 5.5%. This means that a significant portion of the OOV-words are not necessarily day or event related (new events cause new words to show up).
Therefore, the following vocabulary adaptation approach makes use of acoustic similarity instead of semantic similarity to reduce the OOV-rate. A first recognition run on a general baseline dictionary is followed by a second recognition run with a dynamically adapted dictionary of the same size but a smaller OOV-rate. Especially in a time uncritical process like the recognition of broadcast news this seems to be a practical idea.
In a first recognition run, word lattices for all test utterances are created. The lattice is then used to determine, which words are most likely uttered in the segment (namely all words represented in the lattice). For each utterance to be recognized this lattice leads to an utterance-specific vocabulary. This vocabulary is then used to dynamically adapt the recognition dictionary. The basic idea is, that a large number of words in the recognized hypothesis are recognized incorrectly because only the inflection ending is wrong whereas the stem was recognized correctly. In many cases this was not due to misrecognition but because the right word was not even in the dictionary of the recognizer, so constituting an OOV-word. The algorithm below shows the whole Hypothesis Driven Lexical Adaptation process:
This vocabulary adaptation procedure applied to Serbo-Croatian broadcast news data yields a significant improvement in terms of the OOV-rate, which is reduced by 40% (see table 11), and in terms of the accuracy by reducing the error rate by 5.8% (see table 12).
| Vocabulary Size | OOV-Rate | Word Error | |
|---|---|---|---|
| Baseline | 31k | 13.6% | 36.0% |
| Adapted | 31k | 7.9% | 30.2% |
| Wordstem Length | |||||
| Suffix Length | 2 | 3 | 4 | 5 | 6 |
| fixed | - | - | 7.7% | 6.0% | 6.5% |
Table 13 shows that the same result holds for German news data, again a significant reduction of the OOV-rate. For German a fixed list of suffixes was used to create the word stems. Some examples for the used suffixes are given in table 10. Using this linguistic knowledge for decomposition also resulted in a huge OOV-rate reduction from 9.3% to 6.0% (see table 13).
In both languages it turned out to be a good choice to fix the stem length to 5 which is correlated with the distribution of word lengths (50% of the words are longer than 5 letters). Figure 4 shows the distribution of different word lengths in Serbo-Croatian and German.
| Vocabulary Size | OOV-Rate | Word Error | |
|---|---|---|---|
| Baseline | 49k | 8.7% | 29.5% |
| Adapted | 49k | 4.8% | 26.0% |
The same experiments as described above for system B4 were also performed on our latest B5 system. Starting off with a baseline performance of 29.5% WE and an OOV-rate of 8.7%, through HDLA we were able to reduce the number of OOV-words to 4.8%. The 3.9% improvement in OOV-rate was also reflected in a 3.5% improvement in word error rate yielding a performance of 26% WE.
The automatically generated transcripts are inserted into the multilingual Informedia database (see figure 5). In collaboration with the Informedia group at CMU [3] we have now introduced the Serbo-Croatian recognizer into a multilingual information retrieval system. Together with the Serbo-Croatian broadcast video material, the transcripts and the recognizer allow for automatic content-addressable search and multimedia document retrieval across languages (see separate system demo).
The extension of the Informedia database to more than one language will not only add to the diversity of information retrieved by mono-lingual queries but will also offer the possibility to phrase queries in several languages. Thus, the development of our Serbo-Croatian recognizer [4] provides an instance of a potentially larger multilingual information resource.
In this paper we described the development of a Serbo-Croatian broadcast news recognizer. It was shown that despite the very limited amount of training data a performance of 26% word error rate can be achieved. With respect to the problem of encountering excessive growth of vocabularies in heavily inflected languages like Serbo-Croatian and German, Hypothesis Driven Lexical Adaptation turned out to be a very effective means of reducing the rate of out-of-vocabulary words. By applying this two-pass recognition technique morphological variations were generated in a focused fashion which effectively reduced the number of OOV-words by 45% relative from 8.7% to 4.8% OOV-rate.
This research was partly funded by the Advanced Research Projects Agency under contract No. N66001-97-D-8502. The views and conclusions contained in this document are those of the authors and do not necessarily reflect the position or policy of the Government and no official endorsement should be inferred. Special thanks to Alex Hauptmann and all members of the Informedia group at Carnegie Mellon for their help and collaboration.
Petra Geutner