This paper describes preliminary development of a broadcast news transcribing system for this year's Hub4 evaluation. The recognition system uses CROWNS (developed at RU for the 1995 Hub3 tasks) with several modifications to handle the news programming task. Features such as model adaptation have been added to quickly provide acoustic models thought appropriate for the new task, even though the environment-dependent data are limited. The architecture of decoding is changed from one pass to multi-pass that can handle higher order language models more efficiently. Due to the short development period before evaluation, the preliminary system for this year's Hub4 test has produced a higher error rate than expected. In fact, its performance is found to be worse than our previous system when compared on the baselin broadcast speech. We have continued investigation since the test and performed diagnostic experiments. Results and error analysis are given in this report.
We modified our existing system ``CROWNS'' [1] for this year's evaluation. To address the various focus conditions, features such as model adaptation are added to our system. The decoding strategy is also changed from last year's single-pass Viterbi beam search to a multi-pass word graph decoder that can efficiently handle higher order language models.
Due to lack of development time and experience, our system produces results with higher recognition word error rate than expected. It is concluded that this is a result of not performing enough experiments during development. A closer examination of the result from various focus conditions indicates that our system has problems dealing with the large vocabulary used in this year's Hub4 task. After the evaluation, several diagnostic experiments were performed. This paper will describe the experimental results and indicates directions for future work.
In the next section the preliminary system prepared for the 1996 Hub4 test is described including a new decoding scheme. Then we show the use of model adaptation to the Hub4 test. Diagnostic experiments and results are given in section 4. Finally, we present conclusions and future work directions.
CROWNS uses continuous density triphone HMM's as basic acoustic models. The HMM's are trained by conventional forward/backward (EM) algorithm. Each state of the triphone model use mixtures of Gaussians as output distributions. Transition probabilities are fixed. To obtain a more robust and reliable estimate for the huge number of Gaussian mixture used in the system, state tying are used with the furthest neighborhood tying suggested by [2].
There are 7 focus conditions for the primary tests in the Hub4 evaluation:
In recognition of the similarity of some focus conditions, four sets of basic acoustical models are constructed. The WSJ SI284 corpus is used to bootstrap each set of the models. The conditions are described as follows:
Each set of models consists of 6930 word internal triphones with 8782 state clusters. Each state cluster contains 15 Gaussian mixtures with approximately 130K Gaussians in total. No gender models are used.
With the currently available text materials, 3 sets of bigram/trigram language models (LM) are trained. The LM are generated using CMU SLM V1.0 with some fixes on processing the punctuation marks and removing extraneous words. Table 1 shows the trigram perplexity (PP) and OOV rate for 3 different LM's.
| LM training corpus | Perplexity | OOV rate |
| 1995 Hub3 (LM1) | 319.09 | 1.02 |
| 1996 Hub4 (LM2) | 269.55 | 0.86 |
| LM1 + LM2 | 258.77 | 0.87 |
It is observed that by using the text from the 96 Hub4 only (LM2), we get about 15 % trigram perplexity reduction from LM1. Given the fact that the last condition produced lowest perplexity and almost the same OOV ratio to the 96 Hub4 training, our final system uses the LM trained with the combined corpus (LM3).
The recognition lexicon is obtained from the most frequent 60K words from LM3. Pronunciations of the 60k lexicon are extracted mainly from the CMU dictionary with approximately 600 words augmented by hand.
The results of RU 96 Hub4 test are tabulated as follows
| F0 | F1 | F2 | F3 | F4 | F5 | Fx | Average |
| 42.7 | 51.9 | 72.9 | 50.0 | 59.2 | 54.8 | 71.0 | 53.8 |
Compared to the results from other participating sites, our system produced relatively 30% to 40% higher word error rate. Certainly, this is not what we expected. After the evaluation, we have continued the investigation of analyzing errors and conducting several diagnostic experiments. From the results in table 2, it is clearly indicated that our system is not performing well for F0, the baseline broadcast speech. This condition has a dominant overal impact on all other conditions. The most obvious error we made is not performing any guiding experiments during the development period, even though the period was brief. We completed the model training and system integration only one week before the evaluation due to hardware problems. Model adaptation was used without any dry run on the Hub4 dev data. Those factors contributed to the poor performance. We have continued to perform several basic diagnostic experiments to answer the following questions:
One way to examine the question is to use this year's acoustic models in previous ARPA evaluation tasks. To simplify the recognition procedure, we run the 92 WSJ 5k and 20K tests using a standard bigram as LM. For comparison, a previously trained in-house model is used for the same test. Compared to the acoustic models we have in-house , this year system yields a 5 % absolute degradation. The only differences between the two systems are the tying parameters. This year's model is less tied and has more Gaussian mixtures. Originally, it is expected that a more relaxed tying will improve the performance, but on the contrary, it becomes the main degrading factor. Table 3 shows the comparison of recognition word error rates between the two systems.
| System | States/Gaussians | WSJ 5k | WSJ 20K |
|
96 Hub4 |
8782/130K |
16.0 |
20.2 |
|
In-house |
2333/80K |
8.9 |
14.4 |
From the table, it is found that more than 30% relative error reduction can be achieved by using different tying criteria. It is thus concluded that the acoustic models used for this year's test are certainly not good. We proceed to use the better acoustic models to decode the F0 portion data, and found the error rate reduced from 42.7% to 32% ! Obviously, the answer to the question is NO, and the acoustic models are certainly doing some damages to this year's evaluation.
We perform the F0 portion recognition using the LM2 and LM3. No difference in recognition word error rate is found. In terms of perplexity and OOV, these two are very similar. What is needed is to perform the same tests using the LM from 1995 Hub3 (LM1). It would be interesting to see the result and these experiment are under way.
Since we had no test dry-run on the adaptation procedure, we suspect the adaptation lead to a system bug in producing this year's acoustic models. To verify our adaptation procedure, an environmental adaptation is performed in our hands-free distant speech recognition experiments. The goal is to adapt the models trained under clean speech to a more noisy and reverberant environment. Testing speech is recorded 12 ft away from the talker using a microphone array. We find that for the 1000 words recognition task (Distant-RM), the adaptation normally reduced the word error rate from 57% to 15%. Our best result for the same task is 9% using Neural Network based feature domain compensation. This result indicates the adaptation procedure itself has no problem.
The second experiment proceeds to run recognition using both adapted and seed models (Set 1) on the F0 portion data. No difference in recognition performance is found. No gain is observed from running the adaptation procedure. The answer to the question is NO, but not quite. It is still possible that we do not use enough transform (currently, only 1) for given amount of adaptation data. A larger number of transforms should be choosen from more experimental results.
2. C.J. Leggetter and P.C. Woodland, "Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression", Proceedings of the Spoken Language System Technology Workshop, Jan 1995, pp. 110-115
3. D. Paul and B. Necioglu, "The Lincoln Large-Vocabulary Stack-Decoder HMM CSR," Proc ICASSP, Vol II, pp. 660-663 Minneapolis, 1993
4. P.C. Woodland and S.J. Young, "The HTK Tied State Continuous Speech Recognition, "Proc. EuroSpeech, Vol 3, pp. 2207-2210 Berlin, 1993
5. Lee-K.F., "Automatic Speech Recognition: The Development of the SPHINX System", Kluwer Academic Publishers, Boston, 1989
6. H. Murveit, J. Butzberger, V.Digilakis and M.Weintraub, "Large-Vocabulary Dictation Using SRI's DECIPHER (TM) Speech Recognition System: Progressive-Search Techniques," ICASSP, Vol II, pp. 319-322, Minneapolis, 1993
7. V.Steinbiss, B-H.Tran and H.Ney, "Improvements in Beam Search", Proc, Int. Conf on Spoken Language Processing, Philadelphia, PA, Oct 1996
8. G.Antoniol, F.Brugnara, M.Cettolo and M.Federico, "Language Model Representation for Beam Search Decoding," Proc ICASSP, Vol I, pp. 588-591, Detroit, MI, May 1995
9. M.Oerder and H.Ney, "Word Graphs: An Efficient Interface Between Continuous Speech Recognition and Language Understanding", Proc ICASSP, Vol II, pp. 119-122, Minneapolis, MN, Apr 1993