Philips Research Laboratories
Weisshausstrasse 2, D-52066
Aachen, Germany
haeb@pfa.research.philips.com
Speech recorded from radio or television broadcasts exhibits large variations with respect to the quality of the microphone or channel, the characteristics of the speaker, and the condition of the background. Recordings range from high-quality studio recordings of an experienced announcer to very noisy telephone interviews from the trading floor of the stock exchange. Robustness is therefore a major issue for a speech recognizer for such a task.
In our system we concentrated on normalization techniques to come up with a robust feature set that is invariant to changes of the environment or the speaker characteristics, and on adaptation techniques. The goal was not only to improve performance but also to obtain a conceptually simple system with one model set for all genders and environments. Another advantage would be that condition and/or gender need not be classified.
We were attracted by the conceptual simplicity of the BBN approach taken in the Hub-4 '96 evaluation [5]. Rather than making condition-specific models they decided to train just a single model set for all focus conditions. This simplified the system enormously and rendered condition classification obsolete, while at the same time maintaining good recognition accuracy. Inspired by this experience we directed our research effort in the same direction. We will show here that one model set not only simplified the system but also yielded better error rate performance, compared to more complex approaches.
An interesting question is related to the use of linear discriminant analysis (LDA). By definition the transformation matrix is (training) data-dependent and therefore potentially a disadvantage in a highly varying environment. We investigated several options on how to train the LDA.
Employing a single model set for all conditions and environments requires effective channel and speaker normalization and adaptation schemes. Normalization algorithms are typically performed in the signal processing front end of the recognizer, though not necessarily. We show how cepstral mean and variance normalization lead to features that are less sensitive to additive noise and linear channel distortions. Vocal tract normalization serves to remove speaker characteristics to an extent that gender-specific modeling becomes unnecessary. MLLR adaptation is then applied on clusters of segments both for the within-word and cross-word models.
The next section presents experimental results for different databases used to train the acoustic models and the LDA transformation matrix. Section 3 describes variance and vocal tract normalization, and in Section 4 we give a short description of the adaptation approach employed.
In the acoustic modeling we employ continuous mixtures of Laplacian densities with a single, globally pooled deviation vector. We use different model sets for within-word and cross-word decoding and apply decision trees in either case for triphone clustering. More on the acoustic modeling can be found in [1].
In last year's Hub-4 evaluation there was no unanimous view of what would be the best training strategy: was it training on Wall Street Journal data and then doing supervised adaptation on Hub-4, possibly even on each focus condition specifically, or was it Hub-4 training, here again with the option of training focus-specific models or one general model set for all conditions. In light of the availability of another 50h of broadcast news acoustic training data we revisited this question and investigated several alternatives.
We compared the following scenarios:
The motivation for the first scenario was that with wsj0+1 a large and well transcribed database exists, on which we had gained already a lot of experience in the past. Supervised adaptation was conducted with MAP and MLLR. For MLLR, a separate transformation matrix was used for each allophone.
The second scenario promised to encounter the smallest mismatch between training and test data, however, possibly having too few training data per condition; while the third scenario would deliver the simplest system with just one model set for all conditions.
An additional complication resulted from the use of linear discriminant analysis (LDA) in our recognizer [6]. Since the transformation matrix is (training) data-dependent we had to decide on which data to train the matrix. For the experiments reported in Table 1 we used an LDA matrix which had been obtained on the wsj0+1 training data. From our experience of the past we know that a mismatch between the training data used to train the models and the training data used to estimate the LDA transformation could lead to significant performance degradation [7]. Therefore the chosen setup of Table 1 definitively favors the first scenario, where the LDA was trained on the same database as the models.
Table 1: Word error rates in % on Hub-4'96 dev. set (male speakers only) for
different training scenarios. Bigram lm, gender-dependent setup,
within-word models,
no adaptation in recognition, partitioned evaluation.
The clear advantage of training a single model set on all Hub-4 data, as is evident from Table 1, is probably due to the increased amount of acoustic training data compared to last year.
The next question however is, what is the effect of the LDA transformation. The training on the Hub-4 data was better although the LDA matrix had been estimated on the wsj database. A first informal test showed that LDA, however, was still beneficial: using no LDA at all increased the error rate by 5% on the F0 subcorpus. Table 2 compares results for an LDA matrix trained on all Hub-4 data to an LDA matrix trained on wsj0+1 data for training scenario 3. Note that the results for the wsj-LDA are better than in table 1 due to other changes in the system (a.o. variance normalization, see section 3).
Table 2:
Word error rates in % on Hub-4'96 dev. set (male speakers only) for
different LDA matrices. Bigram lm, gender-dependent setup,
within-word models, no adaptation in recognition, partitioned evaluation.
The performance improvement obtained by an LDA matrix trained on Hub- 4 data is not big, however consistent over most focus conditions. It is interesting to note that the eigenvalues of the LDA trained on wsj-data are considerably larger than those of the transformation trained on the Hub-4 data: The largest eigenvalue of the ``wsj LDA'' is 6.95 compared to 4.15 for the ``Hub-4 LDA''. This indicates that the wsj training data are much less noisy such that the average within-class covariance is smaller than in the Hub-4 case. However, although the eigenvalues are better, the ``wsj LDA'' performed worse on the Hub-4 test data. This result must be attributed to the ``mismatch'' between the model training database (Hub-4) and the LDA traning database (wsj).
In the acoustic front end of the Philips Continuous-Speech recognizer mel-frequency cepstral coefficients are computed. Although the segmenter delivers information on the bandwidth of the underlying signal [2], be it narrowband telephone speech or wideband speech, one common signal analysis based on the assumption of wideband data was applied to all data. 15 cepstral coefficients were computed from a 20-channel filterbank, whose center frequencies are equidistant on a mel-scale. The static features, their first-order linear regression coefficients, and the log-energy and their first- and second-order regression coefficients make up the ``preliminary'' feature vector. Then three subsequent preliminary feature vectors are adjoined to a 99-component vector, of which a 35-component feature vector is extracted by LDA analysis.
In order to improve the insensitivity of the feature vector to distortions cepstral mean and variance normalization are applied. It is well known, that a constant, though unknown channel transfer function, affects the mean of the cepstral features. Further it has been observed that additive noise results, among other effects, in a mean shift and reduction of the variance of the distributions of the cepstral coefficients [4].
The mean and variance normalized feature
is computed as follows:
where k is the cepstral index, K being the number of (static)
features.
is an estimate of the mean and
is
an estimate of the standard deviation of the input cepstral
feature
. Both mean and variance are computed over a block of frames,
in our case over one segment, as delivered by the segmenter. This operation is
carried out on all static cepstral coefficients.
The effect of variance normalization is that, irrespective of the dynamic range of the input feature stream, each output feature has unit variance (and power, because of cepstral mean normalization):
T denotes the length of the segment in number of frames. While this normalization is conducted with respect to time for each feature independently, it is easy to see that as a result the variance of each feature vector is unity on average:
On the Hub-4 development data we observed on average a performance improvement of about 3% due to variance normalization, see Table 3.
Table 3:
Effect of variance normalization on word error rates on Hub-4'96 dev. set (male
speakers only). Bigram lm, gender-dependent setup,
within-word models, no adaptation in recognition, partitioned evaluation.
Vocal Tract Normalization (VTN) performs a normalization in the signal space by, typically linearly, warping the frequency axis by a speaker-specific warping factor, see e.g. [8]. The intention is, that after normalization the influence of differences in the vocal tract length across speakers on the computed feature vector are removed to a great extent. We implemented the warping by an appropriate shift of the center frequencies of the mel filter bank. For the warping factor selection we adopted a maximum-likelihood approach similar to [8]: a preliminary transcription of the utterance to be recognized is obtained from a first bigram decoding pass without frequency warping. Then that warping factor is determined which yields the largest likelihood of the test utterance taken the preliminary transcription as hypothesized word sequence, and then the final decoding is conducted with the frequency axis warped according to this factor.
Vocal tract normalization can be carried out in training and in recognition, and it can be used in a gender-dependent (GD) and in a gender-independent (GI) setup. In order to assess different scenarios we ran a number of experiments on the Wallstreet Journal database. Table 4 presents recognition results on the 4 wsj 5k 92/93 dev/eval test sets with training on the wsj0 database.
Table 4:
Effects of VTN on the word error rate for
gender-dependent (GD) and gender-independent (GI) models. WSJ 5k 92/93
dev/eval test sets, bigram lm.
Note that speaker normalization only in training results in worse error rate performance compared to the baseline system without VTN, in particular in the GI case. Only if VTN is also applied in recognition, a reduction in error rate can be achieved.
Although the baseline error rate for a SD setup is slightly better, the results for VTN in training and recognition tend to be better in the GI case. Obviously, VTN is able to discard gender-specific variations from the training data and can beneficially exploit the larger training database. This is consistent with the experience of other researchers, e.g. [9]. We concluded that VTN provides a means to overcome the need for gender-dependent acoustic models.
We repeated some of the scenarios on the Hub-4'96 development data, see Table 5, and could observe similar trends. Note however, that the error rate reduction due to VTN was considerably smaller, e.g. 3.3% when using VTN in training and recognition in a gender-independent setup, compared to 11% on wsj. We then decided to use a GI setup with VTN in training and recognition for the Nov'97 evaluation.
Table 5:
Word error rates in % on Hub-4'96 dev. set (male speakers only) for
different vtn scenarios. Bigram lm,
within-word models, no adaptation in recognition, partitioned evaluation.
Using VTN in an unpartitioned evaluation poses additional problems. At least for the segmentation we used, the average length of a segment is larger in the partitioned evaluation of the development set (13 seconds) than in unpartitioned evaluation of the evaluation data (6.5 seconds for eval'97). We observed that the estimation of the warping factor was the less reliable the shorter the segments were on which the warping factor was estimated. We therefore decided to do no frequency warping for segments, for which we had fewer than a certain minimum number of frames to estimate the warping factor. Table 6 presents recognition results for a minimum number of 100 frames. Due to this threshold we did no frequency warping for 9% of the segments. This of course was no ideal solution since no normalization on the recognition data is unfavorable if the training data had been normalized. Currently we are trying to apply VTN on a per segment cluster level rather than on a per segment level.
Table 6:
Effects of VTN on the word error rate for
gender-dependent (GD) and gender-independent (GI) models. Hub-4 eval'96 test
set, bigram lm, within-word models, unpartitioned evaluation, NIST'96 scoring rules.
MLLR unsupervised adaptation of the mean vectors is applied on clusters of segments using the Least Mean Squares approximation [10]. For information on the clustering procedure, see [11]. The regression classes are based on phonetic knowledge and are dynamically defined using a tree organisation. The amount of adaptation speech determines both the number of active regression classes and the structure of the MLLR transformation matrices. In light of the presumably high error rate we adopted a conservative approach and used more than one MLLR transformation matrix only for clusters with more than 10000 frames. We used a single block-diagonal or purely diagonal matrix if the number of observations was below 1000 and 200, respectively.
Note that MLLR adaptation was applied to both the within-word model set and the cross-word model set. Table 7 presents the results for adaptation of the mean vectors of the within-word models. It can be seen that the error rate improvement due to VTN and MLLR was about 8% on the eval'97 data.
Table 7:
Word error rates on eval'97 for bigram lm, gender-independent
setup, within-word models. NIST'97 scoring rules.
By applying channel (mean and variance normalization) and speaker (vocal tract normalization) normalization techniques, as well as speaker adaptation (MLLR), focus-, gender- or bandwidth-specific acoustic modeling was avoided. We achieved our eval'97 results with only two model sets, one for within-word and one for cross-word decoding.