Most of the previous speech recognition research deals with segmented speech such as the WSJ corpus, where the speaker and condition are constant over session. A cluster for any particular speaker can be generated by merging all the segments from the same speaker. However for many real-world continuous speech recognition problems, these are usually not available, nor are the boundaries of the speech. For continuous speech recognition such as the Hub4 evaluation, we need to segment half-hour audio programs and cluster the automatically segmented speech into speaker clusters. As an important component of the recognition systems, a good speaker clustering procedure can improve the performance of continuous speech recognition systems by supporting unsupervised adaptation. We have observed 10-25% relative WER reduction for unsupervised adaptation on a variety of tasks. Without the support of good clusters, the reduction could be smaller.
The goal of speaker clustering is to classify segmented speech into clusters such that each cluster contains speech from one speaker and also speech from the same speaker is classified into the same cluster. In practice, we regard speaker as a generic concept which really means speaker with channel and background condition. Thus, speech from the same physical speaker with significantly different channel and/or background conditions should be treated as speech from two different speakers in speaker clustering. On the other hand, we may want to classify speech from two speakers in the same cluster if their acoustic charactistics are not significantly different. In any case, the ultimate effectiveness of speaker clustering will be measured by how well the clusters do in adaptation.
We developed and implemented a speaker clustering algorithm which automatically determines all parameters based on a penalized model selection criterion. Our algorithm takes the advantage of the obvious fact that consecutive segments are more likely to come from the same speaker, but does not assume any prior knowledges about the speakers and their speech. We also introduce a penalty against too many clusters, so as to avoid the unwanted solution of one segment per each cluster. Both parameters, i.e. the measure of consecutive segments being from the same speakers and the number of total clusters, are data driven and the algorithm is fully automatic. Experiments show that this automatic speaker clustering algorithm improves unsupervised adaptation as much as the hand labeled ideal case where the clusters are generated based on true speaker, channel and background condition.
1996 Hub4 evaluation includes both the partitioned evaluation (PE) and the unpartitioned evaluation (UE) tests. In the PE test, the data was already partitioned into segments having constant speaker/channel/background conditions and each segment was given a feature label denoting these conditions. In the UE test, the speech needs to be segmented and the feature labels were not available. We believe the UE test is the real-world problem and chose to not to use the segmental feature labels in the PE test, in order to focus on approaches that would be viable for the general case. So the same speaker clustering procedure was used in both PE and UE Hub4 evaluation. In our Hub4 PE system, there is a procedure that chopped the original segments into shorter ones so that the BYBLOS decoder [NGUYEN97] could handle them more efficiently. From experiments on the development data, our speaker clustering algorithm seldom misclassifies chopped segments into different clusters due to ignoring the segmental labels in the PE test. This suggests that our speaker clustering procedure is reliable and should work well in the UE test too. In fact, our PE and UE systems are almost identical except the segmentation and gender detection procedures. The evaluation results show that the total degradation of our UE test from the PE test is only about 5% relatively.
In next section, we will describe the details of the speaker clustering algorithm. In section 3, some experimental results are provided to show the effectiveness of this algorithm. Finally in section 4, we will discuss other alternative model selection criteria, potential application of speaker clustering in speaker adapted training (SAT) [TASOS96] [MATSOUKAS97].
Consider that we have a collection of segments S = {
},
and each
represents a sequence of spectral feature vectors, i.e. the
Cepstral vectors in our implementation. Speaker clustering means to find a
partition P = {
} of S such that each
contains only segments from the same speaker/condition and also
speech segments from this speaker are classified into
only. Assume
that the vectors in each of these sequences can be modeled as coming from a
multivariate Gaussian distribution and that the vectors are statistically
independent. A good clustering solution should have relatively small dispersion
within clusters. The within-cluster dispersion [Wilks] is defined as
where
is the covariance matrix and
is the total
number of feature vectors in cluster
.
There are several good clustering criteria [EVERITT]. We prefer to use the determinant of W to measure the goodness of speaker clustering. That is, the best clustering solution can be obtained by minimizing the measure over the parameter space. However in practice, this will usually lead to the unwanted clustering solution of one segment per cluster. Some penalty against too many clusters will help avoid the unwanted solutions. Thus, the best clustering solution will be obtained by minimizing the penalized measure instead.
There are three components in the implementation of the algorithm,
is the best clustering solution. This criterion is one of the most favorite.
But in practice, it is almost sure that it will end up with the unwanted
solution of one segment per cluster, because the determinant measure will
be non-increasing as the number of clusters increases. One approach to avoid
this is to introduce a penalty against having too many clusters in the partition.
So instead, we use the penalized criterion which will chose a partition
that minimizes
The figure 1 illustrates how the penalty helps to avoid the unwanted clustering solutions during model selection over the parameter space.
The automatic clustering algorithm generated 25 clusters for the episode mentioned in table 1. Only 4 of them had segments with mixed speech conditions. Telephone/bandlimited speech was never mixed with anything else. This indicates that the blind clustering algorithm distinguish speech conditions quite well, especially for telephone/bandlimited speech.
For the PE development data, the segmentation was made manually with each segment having a feature label that is constant on speaker, channel and background. The hand segmentation practice has a tendency to make the segments as long as possible. This should help the WER reduction performance of PE segments adaptation in the PE, because each segment based cluster could actually contain several chopped segment from the same PE segment. However in the UE, the segmentation is done by the automatic segmentation procedure on the half-hour audio programs. So segment based adaptation actually means adaptation on clusters with only one chopped segment in each. Chopped segments are usually very short, i.e. less than 10 seconds or about 20 words. With less than 20 seconds of speech, unsupervised adaptation may not be robust enough. It is therefore expected that speaker clustering should help more, relative to segment based adaptation, in the UE than in the PE.
We applied this speaker clustering technique in 1996 Hub4 evaluation. Recently we did experiments to assess the performance of the automatic speaker clustering algorithm. The results, in table 2, shows that the automatic algorithm did only relative 0.7% worse than the hand labeled ideal case, where the relative WER was about 6.3% for the unsupervised adaptation.

Although the automatic speaker clustering algorithm improves almost as much as the hand labeled ideal clustering based on speaker, channel and background condition, the model selection has a tendency to find less number of clusters than the truth as indicated in table 3.

We found out that only seven original segments had their chopped segments not clustered together. The algorithm is in favor of putting speakers together and against splitting speech from the same speaker into different clusters. Since speech for some speakers in the episodes was very little, it might actually help reduce word error rate by merging speakers in the same cluster if their acoustic charactistics are not significantly different. It turns out that putting speakers together doesn't hurt.
or
for some constant C.
Speaker clustering could also be used in speaker adapted training (SAT), similar to Padmanabhan's approach [PADMANABHAN]. The training data of our Hub4 models includes speech from almost 2400 speakers, and most of them have less than 20 seconds total speech. Too little speech per speaker could cause unrobustness of the transformation matrix estimates in SAT training. Since speakers with the same condition labels can differ a lot acoustically, it might make sense to cluster these speakers by the automatic clustering algorithm and use the clusters as generic speakers in training.
2. Anastasakos, T., J. McDonough, R. Schwartz, "A Compact Model for Speaker-Adaptive Training," Proceedings of ICSLP-96, Philadelphia, PA, October 1996.
3. Matsoukas, S., R. Schwartz, H. Jin, L. Nguyen, "Practical Implementations of Speaker-Adaptive Training," in these Proceedings.
4. Nguyen, L. R. Schwartz, "Efficient 2-Pass Nbest Decoder," in these Proceedings.
5. Gish, H., et al., "Segregation of Speakers for Speech Recognition and Speaker Identification," IEEE International Conference on Acoustics, Speech, & Signal Processing Conference Proceedings, 1991.
6. Everitt, B. Cluster Analysis, Halsted Press, New York, 1980, pp. 24-35.
7. Wilks, S., Mathematical Statistics, Wiley and Sons, New York, 1962.
8. Padmanabhan, M., et al., "Speaker Clustering and Transformation for Speaker Adapatation in Large-Vocabulary Speech Recognition Systems," IEEE International Conference on Acoustics, Speech, & Signal Processing Conference Proceedings, 1995, pp. 701-704.