TOWARD AUTOMATIC RECOGNITION OF JAPANESE BROADCAST NEWS

Tatsuo Matsuoka*, Yuichi Taguchi**, Katsutoshi Ohtsuki*, Sadaoki Furui*, and Katsuhiko Shirai**

* NTT Human Interface Laboratories
3-9-11 Midori-cho, Musashino-shi, Tokyo 180, Japan
** Department of Information and Computer Science, Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169, Japan
e-mail: matsuoka@splab.hil.ntt.co.jp



ABSTRACT

In this paper we report on automatic recognition of Japanese broadcast-news speech. We have been working on large-vocabulary continuous speech recognition (LVCSR) for Japanese newspaper speech transcription and achieved reasonably good performance. We have recently applied our LVCSR system to transcribing Japanese broadcast-news speech. We extended the vocabulary to 20k words and trained the language models using newspaper texts and broadcast-news manuscripts. These two language models were applied to our evaluation speech sets. The language model trained using broadcast-news manuscripts achieved better results for broadcast-news speech than the language model trained using newspaper texts which achieved better results for newspaper speech. In preliminary experiments on Japanese broadcast-news transcription, we achieved a word accuracy of 79.3% for anchor-speakers' speech by using a language model trained using broadcast-news manuscripts and newspaper texts.

1. INTRODUCTION

The DARPA Hub-4 test that began in 1995 is evaluating the use of large-vocabulary continuous speech recognition (LVCSR) to transcribe audio recordings of broadcast news. Several preliminary Hub-4 evaluation results have been reported [1-5]. Coincidentally, in 1996, the Japanese government announced that it will issue a regulation in several years requiring TV news programs to be closed captioned.

Transcribing broadcast news is a challenging task, and thus a good test of applying LVCSR technology to real-world systems. We are therefore investigating the automatic recognition of Japanese broadcast-news speech. This paper describes some of our preliminary results.

We have been working on LVCSR for read newspaper speech. So far, a word accuracy of about 90% has been achieved for a 7k-word-vocabulary [6-8].

Figure 1 shows the progress of our LVCSR performance for newspaper speech recognition. We found that bigram and trigram language models are very effective for Japanese LVCSR. Our trigram language model reduced the word error rate from 18.1% to 10.1%. This improvement is much larger than those for other languages.


Fig. 1 LV CSR experimental results for newspaper speech.
CI: context-independent acoustic models were used,
CD: context-dependent acoustic models were used,
NG: no grammar models were used,
BG: bigram language models were used,
TG: trigram language models were used,
ENGY: energy parameters were added to the feature parameters.

We have applied our LVCSR system to transcribing broadcast-news speech.

We extended the vocabulary to 20k words and trained the language models using newspaper texts and broadcast-news manuscripts. We conducted phoneme-recognition experiments to examine if broadcast-news speech is acoustically more difficult than read newspaper speech. Then we experimentally compared two language models: one trained using broadcast-news manuscripts and one trained using newspaper texts.

2. BROADCAST NEWS DATA

Raw audio recordings of broadcast news include frequent speaker changes, background music, and telephone speech, such as field reports. We segmented these parts manually and used only the clean-speech parts, i.e., those parts not containing background music, noise, or telephone speech for the experiments reported here. Even using only clean speech is still challenging because news speech is usually much more fluent than read speech. Furthermore, we found that the sentences are much longer for broadcast news than for newspapers. As shown in Fig. 2, the average number of words per sentence in broadcast-news manuscripts is almost double that in newspaper texts.


Fig. 2 Histogram of the number of words per sentence

To apply n-gram language models, we segmented the broadcast-news manuscripts into words by using a morphological analyzer because Japanese sentences are written without spacing between words. Some of the irrelevant symbol-marks, such as bullets, were filtered out. The details of this text filtering process are described in [6, 7]. A word-frequency list was derived from the filtered sentences, and the 20k most frequently used words were selected as the vocabulary words. This 20k-vocabulary covers about 98% of the words in the broadcast-news manuscripts. Table 1 lists the training-text size and the coverage for broadcast-news, the Nikkei newspaper and the Wall Street Journal. The training-text size is noticeably smaller for the broadcast news than for the newspapers.


News Nikkei WSJ
Training text size (words) 24M 180M 237M
Number of distinct words 114k 623k 476k
5k coverage 91.5% 88.0% 90.6%
7k coverage - 90.3% -
20k coverage 98.0% 96.2% 97.5%
30k coverage - 97.5% -
65k coverage 99.7% 99.0% 99.6%
Table 1 Comparison of lexica and LM training corpora

3. ACOUSTIC MODELING

The acoustic models we used were all shared-state context-dependent phoneme HMMs designed using tree-based clustering [9]. The total number of states was 2106, and the number of Gaussian mixture components per state was 4. They were trained using phonetically-balanced sentences and read dialogue speech spoken by 53 speakers. The total number of utterances was 13,270.

To investigate the acoustical difference between broadcast-news speech and read newspaper speech we conducted phoneme-recognition experiments. Table 2 shows the phoneme recognition results. The percent correct and accuracy were calculated as follows.

%Correct=((N-sub.-del.)/N)*100

Accuracy=((N-sub.-del.-ins.)/N)*100


News Nikkei
% Correct 82.0% 80.7%
Accuracy 61.5% 64.7%
Table 2 Phoneme recognition for news speech and read newspaper speech

The accuracy was almost the same for the broadcast-news speech and the read newspaper speech. Therefore, it can be said that acoustic models trained using read speech are applicable to news speech LVCSR.

4. LANGUAGE MODELING

N-gram language models have shown significant effectiveness in Japanese LVCSR for read newspaper speech [6-8]. We can expect the same effectiveness for news speech transcription. To train n-gram language models we need a large amount of text data. As Table 1 shows, it is usually easier to collect a large amount of data for newspaper texts than for broadcast-news manuscripts. Therefore, it would be helpful if a newspaper language model also worked well for broadcast news.

To determine if a newspaper language model can be used for broadcast news, we trained two language models, one using broadcast-news manuscripts and one using newspaper texts. Table 3 shows the number of distinct unigrams and bigrams and the average occurrence of n-gram models in the training texts. The number of distinct bigrams for broadcast news was much smaller than that for the Nikkei newspaper due to the small training-text size.


n-gram Distinct no. Av. occurrence
Broadcast news unigram
bigram
20k
0.9M
1160
24
Nikkei
newspaper
unigram
bigram
20k
3.6M
8747
44
Table 3 Number and average occurrence of distinct n-grams

The language models used in the LVCSR experiments described in the next section were basically bigram models unless otherwise described. The bigram models were smoothed using Katz's smoothing method [10].

5. LVCSR EXPERIMENTS

The evaluation speech sets are summarized in Table 4. The news-speech data set was divided into two parts: one for anchor speakers and one for other speakers. For comparison, we also used a read newspaper speech set that had a 30k-vocabulary.


Anchor Others Nikkei
No. of speakers 5 6 10
No. of utterances 100 125 100
No. of words 4184 2285 2168
OOV rate 0.9% 3.7% 3.5%
Table 4 Evaluation speech

The LVCSR results are shown along with the test-set perplexities in Table 5. The two language models (LMs) were applied to each evaluation speech set. The News LM achieved better results for news speech (Anchor and Others) than the Nikkei LM, which achieved better results for read newspaper speech (Nikkei). To investigate why the News LM showed poor performance for Others, we plotted word accuracy for each speaker against test-set perplexity (Figs. 3 and 4). We found that word accuracy depends on test-set perplexity, not on the type of speaker (See Fig. 4).


Task Language
model
Test-set
perplexity
Word
accuracy
News(Anchor) News LM
Nikkei LM
105
190
76.3
68.5
News(Others) News LM
Nikkei LM
255
281
61.8
59.8
Nikkei(30k) News LM
Nikkei LM
253
100
69.0
77.2
Table 5 LVCSR results


Fig. 3 LVCSR results


Fig. 4 Perplexity vs. word accuracy

Since we had few broadcast-news manuscripts, we trained a trigram model not using broadcast-news manuscripts but using newspaper texts and applied it to broadcast-news speech transcription. The word accuracy improved from 76.3% to 79.3% for Anchor speakers' speech. This result suggests that word trigram models will be reasonably effective for broadcast-news speech recognition if there is a large amount of text data from the same task domain.

6. CONCLUSION

In preliminary experiments on Japanese broadcast-news transcription, we have achieved a word accuracy of 76.3% for anchor-speakers' speech by using a bigram language model trained using broadcast-news manuscripts.

A trigram language model trained using newspaper texts improved the word accuracy to 79.3%. Phoneme-recognition experiments showed that acoustic models trained using read speech are applicable to broadcast-news speech. A newspaper language model was not as effective as a broadcast-news language model for broadcast-news transcription. A language model interpolation or adaptation method is definitely needed.

We are currently working on a language model adaptation that will enable us to use a large number of newspaper texts for training a language model for broadcast-news transcription. We are also incorporating trigram language models to our broadcast-news transcription system.

REFERENCES

1. F. Kubala, T. Anastasakos, H. Jin, J. Makhoul, L. Nguyen, R. Schwartz, and N. Yuan, "Toward automatic recognition of broadcast news," Proc. DARPA Speech Recognition Workshop, pp. 55-60, February 1996

2. U. Jain, M. A. Siegler, S. J. Doh, E. Gouvea, J. Huerta, P. J. Moreno, B. Raj, and R. M Stern, "Recognition of continuous broadcast news with multiple unknown speakers and environments," Proc. DARPA Speech Recognition Workshop, pp. 61-66, February 1996

3. S. Wegmann, et al., "Marketplace recognition using Dragon's continuous speech recognition system," Proc. DARPA Speech Recognition Workshop, pp. 67-71, February 1996

4. P. S. Gopalakrishnan, R. Gopinath, S. Maes, M. Padmanabhan, L. Polymenakos, H. Printz, and M. Franz, "Transcription of radio broadcast news with the IBM large vocabulary speech recognition system," Proc. DARPA Speech Recognition Workshop, pp. 72-76, February 1996

5. F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. Schwartz, "Transcribing radio news," ICSLP-96, pp. 598-601, October 1996

6. T. Matsuoka, K. Ohtsuki, T. Mori, S. Furui, and K. Shirai, "Large-vocabulary continuous speech recognition using a Japanese business newspaper (Nikkei)," DARPA Speech Recognition Workshop, pp. 137-142, February 1996

7. T. Matsuoka, K. Ohtsuki, T. Mori, S. Furui, and K. Shirai, "Japanese large-vocabulary continuous speech recognition using a business-newspaper corpus," ICSLP-96, pp. 22-25, October 1996

8. T. Matsuoka, K. Ohtsuki, T. Mori, K. Yoshida, S. Furui, and K. Shirai, "Japanese large-vocabulary continuous speech recognition using a business-newspaper corpus," ICASSP-97, to appear, April 1997

9. S. J. Young, J. J. Odell, and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Human Language Technology Workshop, pp. 307-312, March 1994

10. S. M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," Trans. ASSP-35, pp. 400-401, March 1987