The broadcast news benchmark tests have potential as a source of ideas for improving continuous speech recognition systems. This paper presents a data analysis method for uncovering such ideas and applies the method to the 1996 and 1997 DARPA CSR Hub-4 results. The method is based on a latent variables model instead of a more familiar regression model. The method identifies certain portions of the test material that result in wide performance differences among systems. Such portions, because some systems could handle them and others could not, are worth thinking about in terms of what system features lead to the performance differences. Identification of specific system differences that are responsible for performance differences may lead to system improvements.
Benchmark tests of continuous speech recognition systems usually entail having each system transcribe the same selection of speech. In the case considered here, the selection is from broadcast news. The system transcriptions are compared to an assumed perfect transcription and system transcription errors identified. System-to-system comparisons of these errors can be used to choose the best system from among those tested. In addition, such comparisons can be used to gain insights into the effects of system differences on performance. Such insights are the goal of the method presented in this paper.
The program of data interpretation we have in mind involves description of differences in system transcription errors, description of differences in system features, and finally development of relations between these two descriptions. We focus on a method for error description because our knowledge of system features is insufficient. Hopefully, those who have in depth system knowledge will be able to use either the error description given here or an error description from the method given here to obtain system insights.
Fundamental to the development of speech recognition systems is comparison of alternatives in terms of observed word error rates. What alternatives should a developer compare? This paper provides help in answering this question, help in finding alternatives that lead to major system improvements. This help can come in the form of segments of speech or speakers to which a developer can listen. This help can also come in the form of segment categories that a developer can study. The segments, speakers, or segment categories provided are the most important for a developer to consider because they are portions of speech that some systems transcribe better than others. Thus, these portions of speech point to improvement that can be made in at least some systems. A developer that pays attention to these portions avoids wasting time on portions of speech that any system can easily transcribe or that no system can transcribe well.
The analysis method presented here is not based on regression, that is, modeling system performance in terms of (potentially) manifest variables such as signal-to-noise ratio, speaker gender, or out-of-vocabulary words [1]. Rather, the analysis method is based on latent variables, which are not given as values for each segment in the data set [2]. Generally, the method consists of looking for latent variables that have a large effect on performance differences among systems. After such variables have been identified, they can be characterized in ways helpful in pinpointing subsystems in need of improvement.
We begin the analysis with system word error rates for partitioned segments. Segments are turns in speaking by a single speaker, and partitioned segments have been further divided at points where the speech condition, the so-called focus condition, changes. The word error rates and segment characteristics were derived through adjudicating differences among three transcribers and annotators. Think of the word error rates for systems and segments as a two-way table with rows corresponding to systems and columns to segments. The method presented analyzes this table, that is, decomposes the table into components that when added back together give the original word error rates. The decomposition involves a component for overall system performance and a component for segment difficulty. Further, there is a component that reflects the degree to which segment error rates are closely or not so closely aligned with overall system performance. Beyond the overall system performance, there may be other differences in system performance that appear when a subset of the segments is considered but are offset by other segments in the overall performance. Such differences can be thought of as the result of latent variables. Finally, what remains is what looks like random variation. The components in the decomposition provide a starting point for an effort to build relations between system features and performance. Such an effort may be time consuming. Thus, in performing this decomposition, there is a need to identify those components distinct enough to be worthy of careful examination.
There are various tables to which our analysis can be applied. Section 2 contains the analysis of the system by segment tables for 1997 and 1996. Instead of segments, we can form tables of speaker word error rates or focus condition word error rates and apply our analysis. As discussed in Sections 3 and 4, collapse of the columns to speakers or focus conditions can make some components more distinct and thus more clearly worthy of consideration. In Section 5, we discuss a somewhat different topic, inference from the 1997 results. We discuss the question of how the systems would compare if they were used to transcribe a much larger collection of speech.
Consider the 1997 test results for the focus conditions F0 (Baseline Broadcast Speech) and F1 (Spontaneous Broadcast Speech), the two focus conditions agreed on as the emphasis for 1997. The system word error rates for these two focus conditions combined are given in the second column of Table 1. We have omitted the results from OGI because their word error rate for F0 and F1 is 31.6, which is so much different from the other results that the OGI results might dominate the analysis.
Table 1. 1997 Results for F0 and F1 Combined.
|
System |
Word Error Rate (WER) |
Centered WER |
|
bbn |
13.5 |
-1.6 |
|
cmu |
17.1 |
2.0 |
|
cu-con |
18.7 |
3.7 |
|
cu-htk |
11.7 |
-3.4 |
|
dragon |
16.8 |
1.7 |
|
ibm |
12.7 |
-2.3 |
|
limsi |
13.2 |
-1.9 |
|
philips |
16.7 |
1.7 |
|
sri |
15.2 |
0.1 |
The system word error rates in Table 1 can be viewed as weighted averages of the word error rates for system and segment. Letting the index

Note that, for each system, this equation is the ratio of the total number of errors for all segments divided by the total number of words for all segments. The third column in Table 1, labeled

In the analysis presented here, these centered word error rates portray the overall ability of each system.
To complete what might be thought of as the simplest possible analysis of the system by segment word error rates, we include a term representing segment difficulty. For simplicity, we estimate this term without centering; we take as the estimate

The model
![]()
where
One extension is addition of a term that describes variation from segment to segment in the effect of the overall system abilities [2]. This extension leads to the model
![]()
One can think of the

Note that this estimate is just a linear regression of adjusted segment word error rates on the overall system abilities
Characterization of the segments that have the highest values of

Whatever the values of

We can compare the former to the latter to see if the variation in
One might expect that deviation from the assumption that the variance of
Plotting
In some cases, adding another term to the model of
![]()
In order to distinguish this new term from the ones considered already, we require that

and that the vector

The interpretation of this new term has an analogy in intelligence testing where one first considers general intelligence and then verbal ability versus mathematical ability. Similarly, this new term portrays some segments that some systems do well with and other segments that other systems do well with.
We estimate the new term by applying the singular value decomposition to the matrix with elements
![]()
The rank of this matrix is the smaller of
![]()
where the columns of the matrix

and

For the 1997 F0-F1 data that we have considered so far in this paper, the non-zero values of
Consider now the 1996 test results for the focus conditions F0 and F1, results which were discussed last year by Pallett, et al. [3]. The scoring we consider is the one used last year, not the somewhat revised scoring used for the 1997 results. The system word error rates for these two focus conditions, the values of
Table 2. 1996 Results for F0 and F1 Combined.
|
System |
WER |
Centered WER |
Contrast |
Segment by Dole |
|
bbn |
30.2 |
-1.0 |
-1.4 |
22.5 |
|
cmu |
33.8 |
2.5 |
7.4 |
60.5 |
|
cu-con |
34.3 |
3.1 |
-2.8 |
33.2 |
|
cu-htk |
27.4 |
-3.9 |
1.0 |
17.8 |
|
ibm |
30.7 |
-0.6 |
-1.7 |
22.1 |
|
limsi |
28.0 |
-3.2 |
0.1 |
15.8 |
|
sri |
34.4 |
3.1 |
-2.5 |
24.5 |
As above, we first consider
The non-zero values of
Tables 1 and 2 show that the average word error rate for 1997 is 15.1 and for 1996 is 27.1. Thus, from 1996 to 1997, there is a 44 percent (relative) decrease in the word error rate. One might wonder how much of this decrease is due to easier material and how much to system improvement. One might conjecture that easier material would correspond to easier speakers and therefore that one should consider speakers that appear in both years. Table 3 shows five speakers, David Brancaccio, Donna Kelly, Leon Harris, President Clinton, and Senator Dole, that are included in both test sets. Along with each speaker is the median of the relative decrease in system word error rate, the median over the seven systems that participated in both years.
Table 3. Speakers Common to Both Years
|
Speaker |
Median Relative Decrease |
|
Brancaccio |
37% |
|
Kelly |
54% |
|
Harris |
48% |
|
Clinton |
5% |
|
Dole |
9% |
We see that the first three speakers show a relative decrease in word error rate comparable to the decrease in the average for all speakers and systems. Assuming that the sites did not anticipate these particular speakers, the results for these speakers suggest that the 1997 decrease in word error rate is largely due to system improvement. The last two speakers do not show a relative decrease as large. Why did the systems improve less for these last two speakers? One possible explanation is the change in source for these speakers from 1996 to 1997. In 1996, all the material was from news broadcasts. In 1997, the contribution of these speakers was a portion of the CSPAN archive of the presidential debates. Further investigation might show this change in source to be responsible for the apparent discrepancy between the five speakers. On the other hand, it may be true that the separate effects of easier material and system improvement on the word error rate cannot be resolved.
It is well known that performance varies with speaker and that for a particular speaker, performance differs between speech that is previously prepared material being read and speech that is spontaneously formed. In addition, there are other variations in speech that can cause performance for a particular speaker to change. These include rate of speech, grammatical complexity of the material, and use of out-of-vocabulary words. For broadcast news, within a focus condition, it seems plausible that speaker-to-speaker variation is generally much larger than other segment-to-segment variation. For this reason, one might consider applying the foregoing analysis to tables of word error rates by system and speaker.
The word error rates for a particular speaker are obtained from the word error rates for segments spoken by that speaker. The speaker word error rate can be viewed as a weighted average of the segment word error rates with weighting given by the number of reference words in the segment. This gives the word error rate for a system-speaker category as the number of errors for the category divided by the number of reference words in the category. For this reason, parts of the analysis do not change as one goes from segments to speakers. The system word error rates and the
We analyze the 1997 system by speaker word error rates separately for focus conditions F0 and F1. The results for F0 are given in Table 4.
Table 4. 1997 Results for F0.
|
System |
Word Error Rate |
Centered WER |
Contrast |
|
bbn |
11.4 |
-1.2 |
-1.0 |
|
cmu |
14.4 |
1.8 |
2.2 |
|
cu-con |
15.5 |
2.8 |
-0.5 |
|
cu-htk |
9.9 |
-2.8 |
-1.2 |
|
dragon |
13.9 |
1.2 |
0.2 |
|
ibm |
10.3 |
-2.4 |
2.8 |
|
limsi |
11.6 |
-1.1 |
-0.7 |
|
philips |
14.4 |
1.8 |
-1.0 |
|
sri |
12.5 |
-0.1 |
-0.7 |
We see that the system word error rates are not much different from those shown in Table 1 for F0 and F1 combined. There are 49 speakers. The F ratio for the values of
Because the first term in the singular value decomposition stands out (the non-zero values of
Table 5. Speakers Associated With the 1997 F0 Contrast
|
Speaker |
Number of Words |
Loading |
|
Hollings |
107 |
7.07 |
|
Kennedy |
169 |
3.59 |
|
Harris |
896 |
0.92 |
|
Dole |
326 |
1.27 |
|
Clinton |
282 |
-1.34 |
|
Moret |
47 |
-3.22 |
Compared to other speakers, the systems from CMU and IBM did poorly with the first four of these speakers and did well with the last two. We note that as in 1996, the CMU system had trouble with Senator Dole in 1997.
The 1997 system results for F1 are given in Table 6.
Table 6. 1997 Results for F1.
|
System |
Word Error Rate |
Centered WER |
Contrast |
|
bbn |
19.1 |
-2.5 |
-2.4 |
|
cmu |
24.2 |
2.6 |
-1.3 |
|
cu-con |
27.5 |
5.9 |
-1.5 |
|
cu-htk |
16.5 |
-5.0 |
1.2 |
|
dragon |
24.6 |
3.1 |
5.9 |
|
ibm |
19.3 |
-2.2 |
-0.1 |
|
limsi |
17.4 |
-4.1 |
0.8 |
|
philips |
23.0 |
1.4 |
-0.8 |
|
sri |
22.3 |
0.8 |
-1.8 |
We see that the system word error rates are larger than those for F0 shown in Table 4 but that the system-to-system differences are not much different. There are 12 speakers. The F ratio for the
The
In this section, we consider the test data grouped by focus condition and include all seven focus conditions. The corresponding two-way table of word error rates is familiar from last year
=s presentation. Of interest is the fact that beyond the overall system performance, performance for focus conditions F2 (speech over telephone channels) and FX (all other speech) differentiate some systems from others. This suggests that these systems may have some advantages even though their overall performance is not the best.
The system results for 1997 and all focus conditions are given in Table 7.
Table 7. 1997 System Results for F0-FX.
|
System |
Word error rate |
Centered WER |
Contrast |
|
bbn |
19.9 |
-0.5 |
2.2 |
|
cmu |
22.7 |
2.4 |
-0.6 |
|
cu-con |
25.1 |
4.7 |
0.1 |
|
cu-htk |
15.8 |
-4.6 |
-0.7 |
|
dragon |
22.3 |
2.0 |
0.0 |
|
ibm |
17.4 |
-3.0 |
0.9 |
|
limsi |
17.8 |
-2.5 |
-0.6 |
|
philips |
22.5 |
2.1 |
-0.3 |
|
sri |
19.8 |
-0.6 |
-1.1 |
The corresponding focus condition results are given in Table 8.
Table 8. 1997 Focus Conditions Results for F0-FX.
|
Focus condition |
Difficulty |
Regression |
Loading |
|
F0 |
12.6 |
-0.36 |
-0.42 |
|
F1 |
21.5 |
0.16 |
-0.48 |
|
F2 |
27.1 |
0.63 |
1.95 |
|
F3 |
28.9 |
-0.04 |
-1.17 |
|
F4 |
24.0 |
0.21 |
-0.50 |
|
F5 |
25.5 |
-0.51 |
-1.17 |
|
FX |
37.7 |
0.49 |
1.80 |
The values of
What one cannot see from simple inspection of the table of segment-focus condition word error rates is the
Interestingly, the 1996 data show similar results. The system results for 1996 and all focus conditions are given in Table 9.
Table 9. 1996 System Results for F0-FX Combined.
|
System |
Word error rate |
Centered WER |
Contrast |
|
bbn |
30.4 |
-1.3 |
0.0 |
|
cmu |
35.2 |
3.6 |
0.6 |
|
cu-con |
34.9 |
3.2 |
0.1 |
|
cu-htk |
27.8 |
-3.9 |
1.1 |
|
ibm |
32.3 |
0.7 |
1.9 |
|
limsi |
27.4 |
-4.3 |
-1.2 |
|
sri |
33.5 |
1.9 |
-2.5 |
The focus condition results are given in Table 10.
Table 10. 1996 Focus Conditions Results for F0-FX.
|
Focus condition |
Difficulty |
Regression |
Loading |
|
F0 |
23.0 |
-0.15 |
-0.87 |
|
F1 |
30.6 |
-0.09 |
-0.35 |
|
F2 |
34.8 |
0.13 |
1.98 |
|
F3 |
28.6 |
0.88 |
-0.53 |
|
F4 |
37.7 |
0.32 |
0.50 |
|
F5 |
31.9 |
0.69 |
-0.87 |
|
FX |
51.2 |
-0.29 |
1.72 |
The most striking difference between 1997 and 1996 is the performance exhibited by all systems. The system rankings, on the other hand, changed only moderately. Instead of comparing the values of
The values of
As an alternative to viewing the broadcast news benchmark tests as a source of ideas for system improvement, one might wonder what the results imply about the superiority of one system over another. By superiority, one would usually mean that one system would perform better than another if applied to a large body of speech that one would be willing to call a population of news broadcasts. A population of news broadcasts might be all the network news shows broadcast during the last 20 years.
The big problem with inferring population performance from the 1997 benchmark test is representativeness. The ten hours of speech from which the 1997 material was selected was itself arbitrarily selected. As shown by the foregoing analysis, there are a variety of factors that affect comparative system performance. These might be present in different proportions in the ten hours than in the population of interest. Determining the effect of the lack-of-representativeness on the 1997 results on comparative performance seems difficult.
If we were to regard the 1997 test data as a random sample from some population, then we could perform a hypothesis test to see whether, in terms of this population, two systems are significantly different in their performance. The question is whether we should regard the test data as a random sample of segments, or a random sample of speakers. A random sample of speakers is a better choice for the following reason. If we were to select a random sample of segments from a population, then we would obtain many more speakers than in the 1997 test data and thus a greater variety of speakers. Since speaker is an important determinant of performance, the segments in the 1997 test data exhibit a statistical dependence that would make the usual estimate of variance invalid and thus a hypothesis test invalid.
To test the difference in word error rates between two systems, one must realize that this difference is a ratio estimate. The numerator is the difference in total errors between the two systems and the denominator is the total number of words in the test set. In the context of random sampling, the numerator and denominator are both random variables. This case is treated by Cochran [4].
Consider the differences in the word error rate obtained from Table 1. Taking the systems two at a time and using Cochran
=s formula, we obtain standard deviations for the differences that range between 0.4 and 0.8. Thus, the least significant difference at the 0.05 level ranges from 0.8 to 1.6. In rough terms, differences in word error rate less than 1.2 should be regarded as perhaps only due to the peculiar selection of the 1997 test data.
At this time, no conclusion can be reached on the real value of the foregoing analysis because a careful search for connections with system features has not been done. Will listening to the segments by Senator Dole, Senator Hollings, or Bob Edwards suggest anything to system developers? Will the emergence of focus condition F2 as a type of segment that distinguishes systems suggest anything to developers? It is too early to tell.
The process of system development involves many experiments. In interpreting the results of many of these experiments, researchers may look only at the overall word error rate. The method presented here provides a way to obtain more information from such experiments. This information should speed system development.
1. Fisher, W. M.,
AFactors Affecting Recognition Error Rate, in Proc. Speech Recognition Workshop February 18-21, 1996, Arden Conference Center, Harriman, NY.2. Krishnaiah, P. R. and Yochmowitz, M. G.,
AInference on the Structure of Interaction in Two-Way Classification Model, in Handbook of Statistics, Vol. 1, Amsterdam: North-Holland Publishing Company (1980) pp. 973-994.3. Pallett, D. S., Fiscus, J. G., and Przybocki, M. A.,
A1996 Preliminary Broadcast News Benchmark Tests, in Proc. Speech Recognition Workshop February 2-5, 1997, Westfields International Conference Center, Chantilly, VA.4. Cochran, W. G., Sampling Techniques, New York: John Wiley and Sons (1977).