The lowest word-error rate reported this year was 16.2%, contrasting with last year's lowest word error rate of 27.1%. In part, this apparent improvement is due to the much greater proportion of well-recognized F0 data present in the test set. This, in turn, is due to an effort to "balance" the test pool to match those of the training data.
New this year was the completion of tests in languages other than English--Mandarin and Spanish.
Test materials were drawn from a pool of data provided by the Linguistic Data Consortium, comprising ten hours--recordings of 5 television broadcasts from 4 sources, and recordings of 4 radio broadcasts from 3 sources. These materials were supplemented by a seven hour set of recordings obtained from C-SPAN, which was used to provide "speeches"--in this case mostly from candidates for political office. Because interest had been expressed in sampling a diverse range of speeches, the sample selection algorithm, in this case, was limited to selection of fourteen one-minute excerpts, one per speaker, from the ten hours of materials.
Two sites participated in the Spanish language evaluation, CMU and GTE/BBN, and two sites participated in the Mandarin language evaluation, Dragon and IBM.
Figure 1 illustrates the fact that spontaneous speech is more difficult than baseline speech, for all systems.

Because this table is difficult to interpret, Figure 2 presents the results of a rank-ordered representation of overall error rates, showing the range of reported word error rates from 16.2% to 38.8%. Ovals are associated with differences that are shown, using the Matched Pair Sentence Segment word error test [2], to have failed to reject the null hypothesis that there is no performance difference between the systems under test. Thus, the oval associated with the IBM and LIMSI results indicates that there is no significance associated with the performance difference reported for those systems (17.9% and 18.3%). Similarly, the (essentially zero) differences associated with the SRI and BBN systems are shown to be insignificant. Finally, the differences associated with the Dragon, Philips, and CMU systems (23.1%, 23.3%, and 23.8%, respectively) are shown to be of no significance.

The error rates obtained for the three individual (human) transcribers--one from the LDC, one from NIST, and one from the NSA--range from 3.3% to 4.8%.
This year, no "contrastive tests" were outlined in the test specification, but three sites submitted contrastive test results. Notable among these were results for a "near real-time" system reported by GTE/BBN, which ran in approximately 6X real-time, vs. ~200X real-time for the primary system--a 97% relative decrease in run time. For this contrastive investigation of channel and speaker normalization, a word error rate of 25.7% was measured, contrasting with 20.3% for the primary system--a 26% relative increase in word error.
A second set of contrastive test results was submitted by the Cambridge University HTK group involving the use of alternative lattice rescoring methods. These studies included replacement of a unigram cache with the NIST ROVER software, resulting in a small reduction in word error.
The third set of contrastive results submitted to NIST was from Philips and involved channel and speaker normalization as well as speaker adaptation techniques.
As indicated previously, the test material consisted of a one hour test set selected from five hours provided by the LDC using the same selection procedure as for Hub 4 English. In this case, the test materials were drawn from the same sources as the training data. An NSA staff person was made available to NIST to verify the accuracy of the Spanish language transcripts, and to annotate the test data, so as to conform to the test specification for focus condition analysis.
Considerable variation was observed in the degree-of-difficulty presented by the sources of test data. Word error rates for the ECO and Univision source materials ranged from between 25% to 29% for both participants, in contrast with word error rates ranging from 12% to 16% from the VOA. This variation appears to be attributable to both speaking style and rate-of-speech, since the VOA materials predominantly consist of carefully produced baseline speech.
Figure 3 shows the distribution of materials (word counts) for the three sources (ECO, Univision, and VOA) and for the five focus conditions identified from the annotated test set (F0 - the baseline, F1 - spontaneous, F3 - speech in the presence of music, F4 - speech under degraded acoustic conditions, and FX - speech in combinations of conditions). Note that materials obtained from VOA broadcasts dominate, and of the VOA materials, the principal category is F0.


Figure 5 shows the Mandarin character error rate for each of the sources. Note that there is a marked difference in performance--higher error rate--for the materials originating from KAZN. These differences are probably associated with differences in the associated distribution across focus conditions--with KAZN's broadcast format consisting of AM "news radio," and having a relatively larger distribution of spontaneous speech, and of the presence of background music.

As one of the participants noted [4], the better overall performance on this test set "seems to be due to the much greater proportion of well-recognized F0 data present." Another participant [5] noted that "the 1997 evaluation test is substantially easier than the development test set or the 1996 evaluation."
Some portion of the differences in overall performance is undoubtedly due to the differences in the data selection paradigm used by NIST, especially our efforts to "balance" the test set with respect to the frequency-of-occurrence of materials in the different focus conditions, relying on the annotations provided by the LDC. Reconciliation of differences had the result of increasing the percentage of materials in the F0 baseline condition from 35% to 44%, and in the F1 "spontaneous" condition from 15% to 19%, so that 63% of the test set materials ended up classified in the low background noise category. However, looking at the corresponding data for 1996 [6], one finds 29.7% of that data was classified as F0, and 32.7% as F1, thus 62.4% in all in the low background noise category (almost exactly the same percentage as in 1997), so the differences that can be noted reflect greater emphasis on the F0 baseline condition--44% (in 1997) vs. 29.7% (in 1996).
[2] Pallett, D.S., Fisher, W.M., and Fiscus, J.G. "Tools for the Analysis of Benchmark Speech Recognition Tests," Proceedings of ICASSP 90, pp. 97-100.
[3] Fiscus, J.G. "A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER)," Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347-354.
[4] Woodland, P.C., et al. "The 1997 HTK Broadcast News Transcription System," in this Proceedings.
[5] Kubala, F., et al. "The 1997 BYBLOS System Applied to Broadcast News Transcription," in this Proceedings.
[6] Garofolo, J.S., Fiscus, J.G., and Fisher, W.M. "Design and Preparation of the 1996 Hub-4 Broadcast News Benchmark Test Corpora," in Proceedings of the Speech Recognition Workshop, February 2-5, 1997, pp. 15-21.
,-------------------------------------------------------------------------------------------------------------------------------------------------------------------------. | By System Test Subset Scoring Summary | | For the Hub-4E Primary Systems Test | | | | Overall -> Overall | | Baseline Broadcast Speech -> | | Spontaneous Broadcast Speech -> | | Speech Over Telephone Channels -> | | Speech in the Presence of Background Music -> | | Speech Under Degraded Acoustic Conditions -> | | Speech from Non-Native Speakers -> | | All other speech -> | | | | | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | | || 1996 Hub4 Focus Conditions | |-------------------+------------------++------------------+-----------------+-----------------+------------------+---------------------+--------------+------------------| | SYSTEM | Overall || Baseline | Spontaneous | Speech Over | Speech in the | Speech Under | Speech from | All other speech | | | || Broadcast | Broadcast | Telephone | Presence of | Degraded | Non-Native | | | | || Speech | Speech | Channels | Background Music | Acoustic Conditions | Speakers | | | | #Wrd %WE || #Wrd %WE | #Wrd %WE | #Wrd %WE | #Wrd %WE | #Wrd %WE | #Wrd %WE | #Wrd %WE | |=========================================================================================================================================================================| | | Set/Subset #Words and System Set/Subset Average Word Error Rate | |-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | bbn1.ctm.filt | [32834] 20.3 || [13197] 11.4 | [6566] 17.8 | [4882] 31.2 | [1571] 28.1 | [3350] 22.1 | [669] 26.9 | [2599] 42.7 | | cmu1.ctm.filt | [32834] 23.8 || [13197] 14.4 | [6566] 22.8 | [4882] 31.0 | [1571] 33.9 | [3350] 27.3 | [669] 31.1 | [2599] 48.2 | | cu-con1.ctm.filt | [32834] 27.1 || [13197] 15.5 | [6566] 26.3 | [4882] 37.5 | [1571] 35.1 | [3350] 31.2 | [669] 25.7 | [2599] 59.1 | | cu-htk1.ctm.filt | [32834] 16.2 || [13197] 9.9 | [6566] 15.4 | [4882] 20.1 | [1571] 27.9 | [3350] 19.4 | [669] 24.1 | [2599] 29.9 | | dragon1.ctm.filt | [32834] 23.1 || [13197] 13.9 | [6566] 23.4 | [4882] 31.1 | [1571] 34.9 | [3350] 26.5 | [669] 19.0 | [2599] 43.9 | | ibm1.ctm.filt | [32834] 17.9 || [13197] 10.3 | [6566] 17.8 | [4882] 24.9 | [1571] 24.6 | [3350] 20.3 | [669] 18.2 | [2599] 36.3 | | limsi1.ctm.filt | [32834] 18.3 || [13197] 11.6 | [6566] 17.0 | [4882] 22.1 | [1571] 27.9 | [3350] 21.9 | [669] 27.1 | [2599] 36.3 | | ogi1.ctm.filt | [32834] 38.8 || [13197] 28.6 | [6566] 38.0 | [4882] 52.5 | [1571] 50.0 | [3350] 37.3 | [669] 38.7 | [2599] 62.0 | | philips1.ctm.filt | [32834] 23.3 || [13197] 14.4 | [6566] 21.7 | [4882] 30.8 | [1571] 34.4 | [3350] 25.7 | [669] 30.9 | [2599] 47.1 | | sri1.ctm.filt | [32834] 20.3 || [13197] 12.5 | [6566] 20.5 | [4882] 26.4 | [1571] 32.0 | [3350] 23.1 | [669] 26.8 | [2599] 35.2 | |=========================================================================================================================================================================| | | Set/Subset Mean #Words/Speaker and Set/Subset Mean Word Error Rate/Speaker | |-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | bbn1.ctm.filt | [280] 29.8 || [269] 14.6 | [547] 25.2 | [203] 41.0 | [68] 28.4 | [145] 38.3 | [133] 25.1 | [86] 39.4 | | cmu1.ctm.filt | [280] 32.7 || [269] 15.7 | [547] 29.9 | [203] 40.0 | [68] 33.5 | [145] 50.5 | [133] 29.3 | [86] 41.1 | | cu-con1.ctm.filt | [280] 35.2 || [269] 17.2 | [547] 33.1 | [203] 44.0 | [68] 35.3 | [145] 51.2 | [133] 23.3 | [86] 44.9 | | cu-htk1.ctm.filt | [280] 23.8 || [269] 10.6 | [547] 23.2 | [203] 26.7 | [68] 30.1 | [145] 35.7 | [133] 23.0 | [86] 32.9 | | dragon1.ctm.filt | [280] 31.2 || [269] 14.9 | [547] 30.4 | [203] 41.1 | [68] 33.6 | [145] 37.2 | [133] 20.8 | [86] 42.7 | | ibm1.ctm.filt | [280] 25.6 || [269] 11.9 | [547] 22.2 | [203] 32.7 | [68] 27.1 | [145] 33.4 | [133] 16.0 | [86] 35.1 | | limsi1.ctm.filt | [280] 24.4 || [269] 11.9 | [547] 23.6 | [203] 28.5 | [68] 31.2 | [145] 34.4 | [133] 24.7 | [86] 35.2 | | ogi1.ctm.filt | [280] 47.5 || [269] 32.4 | [547] 49.5 | [203] 57.5 | [68] 46.7 | [145] 56.1 | [133] 37.4 | [86] 53.7 | | philips1.ctm.filt | [280] 32.4 || [269] 16.4 | [547] 29.1 | [203] 40.0 | [68] 35.5 | [145] 43.6 | [133] 29.4 | [86] 42.0 | | sri1.ctm.filt | [280] 27.3 || [269] 14.7 | [547] 30.4 | [203] 33.4 | [68] 29.6 | [145] 35.4 | [133] 27.6 | [86] 35.4 | |-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | | Associated Standard Deviations | |-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | bbn1.ctm.filt | [410] 23.9 || [288] 12.9 | [729] 13.5 | [308] 23.5 | [65] 21.1 | [170] 31.0 | [76] 8.3 | [76] 25.3 | | cmu1.ctm.filt | [410] 29.3 || [288] 15.0 | [729] 15.3 | [308] 27.1 | [65] 25.7 | [170] 43.5 | [76] 8.3 | [76] 23.8 | | cu-con1.ctm.filt | [410] 27.6 || [288] 12.1 | [729] 10.8 | [308] 23.9 | [65] 23.4 | [170] 43.6 | [76] 9.2 | [76] 23.7 | | cu-htk1.ctm.filt | [410] 20.5 || [288] 9.1 | [729] 18.2 | [308] 19.4 | [65] 21.2 | [170] 29.6 | [76] 4.5 | [76] 21.1 | | dragon1.ctm.filt | [410] 22.0 || [288] 11.4 | [729] 16.0 | [308] 21.0 | [65] 20.5 | [170] 24.9 | [76] 7.7 | [76] 20.7 | | ibm1.ctm.filt | [410] 22.7 || [288] 14.2 | [729] 10.8 | [308] 20.4 | [65] 15.0 | [170] 26.9 | [76] 8.1 | [76] 23.5 | | limsi1.ctm.filt | [410] 20.3 || [288] 9.6 | [729] 16.7 | [308] 19.7 | [65] 22.7 | [170] 27.6 | [76] 8.2 | [76] 24.3 | | ogi1.ctm.filt | [410] 23.3 || [288] 20.2 | [729] 16.7 | [308] 23.0 | [65] 21.7 | [170] 26.9 | [76] 8.1 | [76] 22.6 | | philips1.ctm.filt | [410] 27.4 || [288] 12.5 | [729] 14.5 | [308] 27.2 | [65] 17.3 | [170] 40.6 | [76] 10.4 | [76] 24.1 | | sri1.ctm.filt | [410] 21.3 || [288] 11.1 | [729] 17.0 | [308] 22.4 | [65] 21.0 | [170] 29.5 | [76] 4.2 | [76] 22.4 | |=========================================================================================================================================================================| | | Set/Subset Median #Words/Speaker and Set/Subset Median Word Error Rate/Speaker | |-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | bbn1.ctm.filt | [129] 21.2 || [174] 10.2 | [182] 23.5 | [61] 31.3 | [46] 22.5 | [64] 25.0 | [129] 25.2 | [73] 33.4 | | cmu1.ctm.filt | [129] 24.3 || [174] 11.6 | [182] 23.7 | [61] 31.2 | [46] 28.6 | [64] 37.8 | [129] 30.5 | [73] 36.6 | | cu-con1.ctm.filt | [129] 29.6 || [174] 13.4 | [182] 30.4 | [61] 36.3 | [46] 28.6 | [64] 35.8 | [129] 23.2 | [73] 41.8 | | cu-htk1.ctm.filt | [129] 17.4 || [174] 8.3 | [182] 15.4 | [61] 22.2 | [46] 22.4 | [64] 31.1 | [129] 23.4 | [73] 27.3 | | dragon1.ctm.filt | [129] 25.2 || [174] 12.5 | [182] 23.2 | [61] 33.1 | [46] 29.4 | [64] 32.1 | [129] 17.8 | [73] 42.8 | | ibm1.ctm.filt | [129] 19.8 || [174] 7.9 | [182] 18.8 | [61] 25.5 | [46] 24.3 | [64] 28.9 | [129] 16.3 | [73] 25.3 | | limsi1.ctm.filt | [129] 16.4 || [174] 10.6 | [182] 16.3 | [61] 21.4 | [46] 24.7 | [64] 26.7 | [129] 25.9 | [73] 29.4 | | ogi1.ctm.filt | [129] 45.3 || [174] 24.1 | [182] 46.7 | [61] 54.0 | [46] 47.1 | [64] 50.0 | [129] 36.4 | [73] 52.4 | | philips1.ctm.filt | [129] 26.3 || [174] 12.6 | [182] 22.6 | [61] 33.1 | [46] 35.0 | [64] 33.3 | [129] 27.9 | [73] 36.4 | | sri1.ctm.filt | [129] 21.2 || [174] 11.7 | [182] 27.0 | [61] 25.8 | [46] 26.6 | [64] 24.1 | [129] 28.0 | [73] 31.5 | `-------------------------------------------------------------------------------------------------------------------------------------------------------------------------'
,---------------------------------------------------------------------------------------------------------------------------------. | Composite Report of All Significance Tests | | For the Hub-4E Primary Systems Test Test | | | | Test Name Abbrev. | | ------------------------------------------------------ ------- | | Matched Pair Sentence Segment (Word Error) MP | | Signed Paired Comparison (Speaker Word Error Rate (%)) SI | | Wilcoxon Signed Rank (Speaker Word Error Rate (%)) WI | | McNemar (Sentence Error) MN | | | | | |---------------------------------------------------------------------------------------------------------------------------------| | Test || | bbn1 | cmu1 | cu-con1 | cu-htk1 | dragon1 | ibm1 | limsi1 | ogi1 | philips1 | sri1 || Test | | Abbrev. || | | | | | | | | | | || Abbrev. | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || bbn1 | | bbn1 | bbn1 | cu-htk1 | bbn1 | ibm1 | limsi1 | bbn1 | bbn1 | ~ || MP | | SI || | | bbn1 | bbn1 | cu-htk1 | bbn1 | ibm1 | limsi1 | bbn1 | bbn1 | ~ || SI | | WI || | | bbn1 | bbn1 | cu-htk1 | bbn1 | ibm1 | limsi1 | bbn1 | bbn1 | ~ || WI | | MN || | | bbn1 | bbn1 | cu-htk1 | ~ | ibm1 | limsi1 | bbn1 | ~ | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || cmu1 | | | cmu1 | cu-htk1 | ~ | ibm1 | limsi1 | cmu1 | ~ | sri1 || MP | | SI || | | | cmu1 | cu-htk1 | ~ | ibm1 | limsi1 | cmu1 | ~ | sri1 || SI | | WI || | | | cmu1 | cu-htk1 | ~ | ibm1 | limsi1 | cmu1 | ~ | sri1 || WI | | MN || | | | ~ | cu-htk1 | dragon1 | ibm1 | limsi1 | cmu1 | philips1 | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || cu-con1 | | | | cu-htk1 | dragon1 | ibm1 | limsi1 | cu-con1 | philips1 | sri1 || MP | | SI || | | | | cu-htk1 | dragon1 | ibm1 | limsi1 | cu-con1 | philips1 | sri1 || SI | | WI || | | | | cu-htk1 | dragon1 | ibm1 | limsi1 | cu-con1 | philips1 | sri1 || WI | | MN || | | | | cu-htk1 | dragon1 | ibm1 | limsi1 | cu-con1 | philips1 | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || cu-htk1 | | | | | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 || MP | | SI || | | | | | cu-htk1 | ~ | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 || SI | | WI || | | | | | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 | cu-htk1 || WI | | MN || | | | | | cu-htk1 | ~ | ~ | cu-htk1 | cu-htk1 | ~ || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || dragon1 | | | | | | ibm1 | limsi1 | dragon1 | ~ | sri1 || MP | | SI || | | | | | | ibm1 | limsi1 | dragon1 | ~ | sri1 || SI | | WI || | | | | | | ibm1 | limsi1 | dragon1 | ~ | sri1 || WI | | MN || | | | | | | ibm1 | limsi1 | dragon1 | ~ | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || ibm1 | | | | | | | ~ | ibm1 | ibm1 | ibm1 || MP | | SI || | | | | | | | ~ | ibm1 | ibm1 | ibm1 || SI | | WI || | | | | | | | ~ | ibm1 | ibm1 | ibm1 || WI | | MN || | | | | | | | ~ | ibm1 | ibm1 | ~ || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || limsi1 | | | | | | | | limsi1 | limsi1 | limsi1 || MP | | SI || | | | | | | | | limsi1 | limsi1 | limsi1 || SI | | WI || | | | | | | | | limsi1 | limsi1 | limsi1 || WI | | MN || | | | | | | | | limsi1 | limsi1 | ~ || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || ogi1 | | | | | | | | | philips1 | sri1 || MP | | SI || | | | | | | | | | philips1 | sri1 || SI | | WI || | | | | | | | | | philips1 | sri1 || WI | | MN || | | | | | | | | | philips1 | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || philips1 | | | | | | | | | | sri1 || MP | | SI || | | | | | | | | | | sri1 || SI | | WI || | | | | | | | | | | sri1 || WI | | MN || | | | | | | | | | | sri1 || MN | |---------++----------+--------+------+---------+---------+---------+---------+---------+---------+----------+---------++---------| | MP || sri1 | | | | | | | | | | || MP | | SI || | | | | | | | | | | || SI | | WI || | | | | | | | | | | || WI | | MN || | | | | | | | | | | || MN | |---------------------------------------------------------------------------------------------------------------------------------| | These significance tests are all two-tailed tests with the null hypothesis | | that there is no performance difference between the two systems. | `---------------------------------------------------------------------------------------------------------------------------------'