This paper also describes some experiments we tried which were not used in the official experiment, including experiments with perplexity minimization, Maximum Entropy modeling and parsing.
The purpose of the denominator in this formula is to compensate for the fact that high-frequency words are more likely to appear in the cache than low-frequency words. For instance, the occurrence of a high-frequency word like ``Clinton'' supplies much less new information to the language model than does the appearance of a relatively low-frequency word such as ``Schwarzenegger''.
Although our experiments have shown that CScore(w) is useful, it doesn't make use of the fact that some words tend to be highly concentrated in a few articles whereas other words are likely to be spread fairly evenly over the corpus. Consider the following two medium-frequency words:
| Word | F(w) | | |
| |
| second | 51707 | 24273 | 9775 | 0.5187 | 0.8065 |
| japan | 50066 | 8103 | 8076 | 2.0945 | 3.1960 |
Definitions:
The two quantities which we used for our weighted cache prediction were
and
-the expected number of reoccurrences of
w given one appearance of w in the document and given 2+ appearances.
These quantities are similar to those described in [10].
Note that we are using the same formula for 2+ appearances of w as for 2
appearances. We did this due to an intuition that further appearances probably
contained no new information and also for simplicity of implementation.
It can be easily seen from the table that these quantities seem to model real phenomena because their values for ``japan'' are much higher than for ``second'' even though the two have about the same unigram frequency. This matches our intuition that ``japan'' is much more likely to be the topic of an article than is ``second''.
We combined these quantities together using the following formula:
In combining these three knowledge sources (sublanguage, cache, and weighted cache), we found from experiments on minimizing the error rate of the devtest data that we achieved our best results by using all three sources (see Table 1).
The absolute improvement is small; however, there is a limit to the improvement we can obtain, because the N-best sentences don't always contain the correct candidate. It is important to see the difference between the number of errors produced by the base system and the minimum number of errors obtainable by choosing the N-best hypothesis with minimum error for each sentence. (We will call the latter error rate ``MNE'' for ``minimal N-best errors''.) Although we don't have the precise number for MNE for the 1996 evaluation, based on our estimate from dev data, we can suggest that our achievement is about 5% of the MNE (possible improvement). We believe that the result is satisfactory, because there are a lot of word errors unrelated to the article topic, for example function word replacement (``a'' replaced by ``the''), or deletion or insertion of topic unrelated words (missing ``over'').
Furthermore, the improvements were achieved with different corpora (WSJ, NAB,and BN) and different speech systems (BBN, SRI), as shown in Figure 2. This is also encouraging, because it demonstrates that the sublanguage technique indeed can work in such different environments.
| Site (Year) | Description | Result | Ref. |
| IBM (91) | cache model | [2] | |
| CMU (94) | trigger model | 19.9->17.8 | [3] |
| BU (93-94) | clustering | 11.3->11.2 | [4] |
| (4 topic LM) | |||
| NYU (94-96) | sublanguage, cache | 11.0->10.6 | [5] |
| and weighted | 24.6->24.0 | [6] | |
| cache model | 33.3->33.0 | ||
| CMU (96) | hand clustering | 0.1,0.6% improve. | [7] |
| (5883 topic) | in 2 story | ||
| SRI (96) | clustering | 33.1->33.0 | [8] |
| (4 topic LM) | |||
| CU (96) | cache model | 27.7->27.5 | [9] |
To get around this problem, we reformulated our cache and sublanguage models to produce probabilities rather than scores and ran some preliminary perplexity minimization experiments on a single day of WSJ data, which represented a 73,000 word corpus vs the 8,000 words which we had in the '95 devtest data. These experiments showed that the interpolation of cache and sublanguage probabilities with the baseline trigram probabilities caused a big decrease in perplexity, but we got no improvement in error rate when the weights which minimized perplexity were used on the devtest data.
Surprised by this result, we reran our perplexity experiments on the devtest text data and got perplexity and word error figures for over 200 different relative weightings of trigram, cache, and sublanguage values. A representative slice of this three-dimensional grid can be seen in Figure 3, which shows perplexity and word error rates for various weightings of the sublanguage/cache component relative to a standard backoff trigram component [16]. Note that in the chart the sublanguage/cache ratio is fixed at 4:6 and that the point ``0.00'' represents a purely trigram model.
| weight for SL/cache | perplexity | Word Errors |
| 0.00 | 151.9 | 720 |
| 0.03 | 131.2 | 716 |
| 0.06 | 128.4 | 715 |
| 0.09 | 127.8 | 719 |
| 0.12 | 128.2 | 721 |
| 0.15 | 129.3 | 724 |
| 0.18 | 130.9 | 727 |
The reader may have noticed that the results achieved by the probabilistic approach were significantly worse than those produced by the original ``scoring-based'' model. This is probably due to various features which were left out of the experimental probability-based model. For instance, the scoring-based model used the entire remainder of the document as context for determining a sentence's cache and sublanguage scores whereas the probabilistic model just used the preceding sentences in the document. It also seemed possible that the scoring-based formulae might be working better than the probabilistic formulae. Another possible explanation is that seeking to optimize the error rate by searching for a linear combination of log probabilities (or scores) may be better than doing the same with a combination of ``unlogged'' probabilities, as we did in the probabilistic model. We mention these differences just to point out that the probabilistic model cannot be directly compared with the scoring model. Since our results indicated that our probabilistic approach was not helpful, we ended up using the scoring-based model in the evaluation.
Maximum entropy modeling (M.E.) offers some of the same benefits of the perplexity minimization method in that it allows us to train on large text corpora rather than on the smaller amount of n-best data for which we have acoustic data available. More importantly, though, M.E. gives us a new way of constructing these models and of combining them in a non-linear fashion.
Consider, for instance, the cache scoring formula of equation 5. This formula was developed according to our intuition of the nature of cache word repetition, but it is vulnerable to criticism on other intuitive grounds. For instance, does it make sense that the (unlogged) score for a word should double when a word has been seen twice as many times in an article? Furthermore, our team has had continuing internal debates about how to handle the interaction of the cache and sublanguage models: i.e. should the sublanguage model predict cache words or should it leave the prediction of those words entirely to the cache component?
M.E. theory [13] [14] offers an intuitively and theoretically satisfying answer to these sorts of questions which vex language modelers. When using M.E. for language modeling, one identifies a set of linguistically significant ``constraints'' and a training corpus and then the M.E. algorithm builds a model which is guaranteed to:
For our experiments, we used a set of constraints which closely mirrored the phenomena we were trying to capture in our previous cache and sublanguage modeling experiments:
The M.E. algorithm will build a model in which the conditional probabilities of
these features will conform, on average, with those found in the training corpus.
This can be expressed more precisely in the following way, using the feature family
as an example. First define an indicator function which is a function
of the document history h and the current word w:
Now we observe in the training corpus that
We then constrain our M.E. language model to only consider conditional models, P(w|h), which conform to this constraint:
One departure of this work from that of other work in the field [14] is that we build a very small, and hence computationally tractable model, using only c. 200 constraints/parameters. Rosenfeld, by contrast, built a model which had c. 2.2 million parameters. The primary reason for the difference is that we are leaving n-gram constraints out of the model, whereas Rosenfeld incorporated them into his. We think that we may be paying a penalty in performance by doing this, but we hope that we will nevertheless squeeze significant benefits out of the model while avoiding the very heavy computational requirements which Rosenfeld reported-roughly two weeks machine time on 15 DEC/Alpha workstations.
Our preliminary results with these experiments were that we achieved only 31% of the gain which we achieved by using the conventional methods (see Figure 1). While these results might seem to be discouraging, we believe that they are due, in part, to the fact that we have not yet had sufficient time to experiment with the techniques.
Since we implemented this using a publicly available M.E. toolkit [15] which permits basically any knowledge source to be used as input so long as it can be parameterized along the lines shown above, we think that we may have the ability to integrate a large number of different linguistic sources into a single, unified model. Among the sources which we are thinking of integrating are:
Recently, we implemented a technique to incorporate the probabilities of lexical dependencies into the parser. We created a simple set of rules to identify the head of each constituent, and assigned dependency relationships between the head and all the other elements. This relationship is actually a long distance, syntactically motivated bigram (for example, between a verb and the head of its subject). In some cases, this dependency bigram can work better than the usual bigram, because the relationship is syntactically meaningful, and not just between consecutive words. However, the only currently available large syntactically tagged corpus is the University of Pennsylvania Tree Bank; we used the Wall Street Journal portion of the Tree Bank to acquire the lexical dependency probabilities. One of the serious and unavoidable problems is the limited size of the training corpus. Compared to the corpus size typically used for bigram training, the training size for the dependency relationships is significantly smaller. One idea for tackling this problem in the future is to use the parser in order to create a relatively reliable tagged corpus. We have found that the approach using the dependency relationships produces good performance for analyzing written text. The typical accuracy measurement (recall and precision of bracketing) improves about 2% compared to the parsing result without dependency relationships.
Because the domain of the training corpus is business newspaper articles, we decided that we would initially try the parsing scheme on the 1995 speech evaluation data from North American Business News domain rather than the 1996 (Broadcast News domain) evaluation.
| trigram favors | trigram favors | |
| correct sent. | SRI-best | |
| Parser favors correct | 16 | 23 |
| Parser favors SRI-best | 9 | 17 |
By looking at the 23 instances in the top-right category -- where the parser predicted correctly while the trigram model did not -- we find a number of encouraging examples. Six example are listed in the Appendix. For example, in the first sentence, macdonnell ..., SRI's best candidate, has no verb, yet the trigram score for the candidate is better than for the correct sentence. In the second sentence they say..., there are too many verbs in SRI's best candidate. This is exactly what we expected to achieve with a parser. In other words, sometimes wide context is more important for picking the correct words than local (trigram) context .
The other categories (16 top-left and 17 bottom-right in the table) are harmless; adding parsing score to trigram score in these cases does not affect the ranking of the two sentences. Many such cases are to be expected because syntactic context often includes local evidence.
Outside of this table, we found an interesting example. It concerns out-of-vocabulary words (in particular, proper nouns) and an example is shown in the Appendix under ``other'' category. It contains an OOV sequence of long proper nouns (``noriyuki matsushima''), but as these nouns are not in the vocabulary, the speech system produced an unusual sequence of words (``nora you keep matsui shima''). We could not calculate a trigram score for the correct hypothesis, but as you can imagine the parser assigned a much better score to the correct sentence. So, it may be interesting for future work to use the technique of parsing in order to try to identify these mistakes on out-of-vocabulary words.
We found some suggestive evidence that the parser may be able to help, although it is not yet at the point of improving recognition accuracy. As it seems promising, it is worth pushing this line of research. This will include improving the parser and also adapting the parser to the recognition task. In particular, because the output style of the speech recognizer is not the same as the written text, we should make some adjustments to the grammar and dictionary. For example, the recognizer output does not have commas or quotation marks, which are significant clues in written text parsing, so the grammar needs to be adjusted accordingly.
2. F Jelinek, B Merialdo, S Roukos, and M Strauss: ``A Dynamic Language Model for Speech Recognition'' Proceedings of DARPA Speech and Natural Language Workshop (1991)
3. Ronald Rosenfeld ``Adaptive Statistical Language Modeling'' Proceedings of Human Language Technology Workshop (1994)
4. M.Ostendorf, F.Richardson, R.Iyer, A.Kannan, O.Ronen and R.Bates ``The 1994 BU NAB News Benchmark System'' Proceedings of the ARPA Spoken Language Systems Technology Workshop (1995)
5. Satoshi Sekine, John Sterling and Ralph Grishman ``NYU/BBN 1994 CSR evaluation'' Proceedings of the ARPA Spoken Language Systems Technology Workshop (1995)
6. Satoshi Sekine and Ralph Grishman ``NYU Language Modeling Experiments for the 1995 CSR Evaluation'' Proceedings of the DARPA Speech Recognition Workshop (1996)
7. Kristie Seymore, Stanley Chen, Maxine Eskenazi and Roni Rosenfeld ``Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation'' Proceedings of the DARPA Speech Recognition Workshop (1997)
8. Fuliang Weng, Andreas Stolcke and Ananth Sankar ``Hub-4 Language Modeling using Domain Interpolation and Data Clustering'' Proceedings of the DARPA Speech Recognition Workshop (1997)
9. Steve Young, M ark Gales, David Pye and Phil Woodland ``HTK Broadcast News Language Model'' Proceedings of the DARPA Speech Recognition Workshop (1997)
10. Slava M.Katz `` Distribution of content words and phrases in text and language modeling'' Natural Language Engineering, Vol.2 Part.1, pp15-60 (1996)
11. Satoshi Sekine, Ralph Grishman ``A Corpus-based Probabilistic Grammar with Only Two Non-terminals'' Proceedings of the Fourth International Workshop on Parsing Technologies (1995)
12. Satoshi Sekine ``Apple Pie Parser: home page'' http://cs.nyu.edu/cs/projects/proteus/app (1996)
13. Edwin T. Jaynes ``Information Theory and Statistical Mechanics'' Physics Reviews 106, pp620-630, (1957)
14. Ronald Rosenfeld ``Adaptive Statistical Language Modeling: A Maximum Entropy Approach'' CMU Technical Report CMU-CS-94-138 (1994)
15. Eric Sven Ristad ``Maximum Entropy Modeling Toolkit, release 1.5 Beta'' ftp://ftp.cs.princeton.edu/pub/packages/memt (1997)
16. Slava M. Katz ``Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer'' IEEE Transactions on Acoustics, Speech, and Signal Processing (1987)
C: correct sentence S: SRI-best candidateParser and Trigram both favor SRI-best
C: some dealers of foreign cars also lowered
their japanese prices (448,655)
S: some dealers of foreign cars also lowered
the japanese prices (424,614)
C: the problem isn't gridlock he says the
wheels are out of alignment (598,646)
S: the problem is in gridlock he says the
wheels are out of alignment (567,625)
Trigram favor Correct, but Parser favor SRI-best (Bad example)
C: board would review distributing the remaining shares in the gold subsidiary to the parent company's shareholders (1328,1306) S: board would review distributing the remaining shares in the gold subsidiary to the parent company shareholders (1254,1333)Trigram favor SRI-best, but Parser favor Correct (Good example)
C: mcdonnell douglas corporation has built
helicopter parts ... (1360,1548)
S: mcdonnell douglas corporation and bell
helicopter parts ... (1404,1491)
C: they are interested in commodities as
a new asset class van says (521,731)
S: they are interested in commodities says
a new asset class van says (560,720)
C: weary of worrying about withdrawal
charges if you want to leave ... (1132,1273)
S: weary of worrying about withdraw all
charges if you want to leave ... (1210,1202)
C: this scenario as they say on t.v. is
based on a true story (550,649)
S: this scenario as a say on t.v. is
based on a true story (576,644)
C: indirect foreign ownership is limited to 25%
(613,723)
S: in direct foreign ownership is limited to 25%
(695,709)
C: even some lawyers now refer clients to
mediators offering to review the mediated
agreement and provide advice if needed
(1045,1255)
S: even some lawyers now refer clients to
mediators offering to review the mediated
agreement can provide advice if needed
(1067,1253)
Others
C: the may figures show signs of improving sales
said noriyuki matsushima (951,?)
S: the may figures show signs of improving sales
said nora you keep matsui shima (1211,?)