Statistical Engineering Division, ITLMichael Chernick, Kevin Mills, Robert Toense
Advanced Network Technologies Division, ITL Modern communication channels, such as digital cellular telephony, convey human speech in a highly encoded form. Improvements in the quality of the communications require objective measurements of that quality. This work was designed to be a first step in exploring the feasibility and applicability of using automated speech recognition technology to model human perception of communication channel quality.
Segments of speech from a widely accepted speech data base were selected, and passed through a speech recognizer under 3 conditions: (1) without encoding, (2) with encoding and decoding using a standard algorithm for speech compression, and (3) with encoding, transmission across a noisy channel, and then decoding. Speech recognition scores were computed for each speech segment under each of the 3 conditions. Human listeners were then asked to subjectively evaluate the intelligibility of a subset of the speech segments under the same conditions.
Of primary interest is the correlation between the intelligibility of speech as evaluated by the automatic recognizer and the human listeners. For speech segments used to train the automated recognizer, the correlation was .816.064 (2 stdevs). For other speech segments, the correlation was .745.074 (2 stdevs). (Spearman rank correlations .789, .775 respectively.) Such results are sufficient to encourage the future investigation of the performance of commercial speech recognizers against human listeners on a more objective basis. For example, on might envision scoring speech recognizers and human listeners on identical speech-to-text transcription tasks, and then computing the correlation in performance. Or the construction of an automated evaluation system, based on speech-to-text transcription, might be considered.
Figure 5: Correlation between speech recognizer and human listeners for trained speakers.
Date created: 7/20/2001