THE TREC-6 SPOKEN DOCUMENT RETRIEVAL TRACK

Ellen Voorhees, John GarofoloKaren Sparck Jones
National Institute of Standards and Technology
Gaithersburg, MD 20899
Cambridge University
Cambridge CB2 3QG, U.K.



ABSTRACT

The Text REtrieval Conference (TREC) workshops provide a forum for different groups to compare retrieval systems on common retrieval tasks. The 1997 TREC workshop will feature a Spoken Document Retrieval task for the first time. This paper motivates the task and describes the measures to be used to evaluate the effectiveness of the retrieval methodologies.

1. THE TEXT RETRIEVAL CONFERENCE

The Text REtrieval Conference (TREC) series is co-sponsored by the National Institute of Standards and Technology (NIST) and the Information Technology Office of the Defense Advanced Research Projects Agency (DARPA) as part of the TIPSTER Text Program. The series, which started in 1992, is designed to promote research in information retrieval by providing appropriate test collections, uniform scoring procedures, and a forum for organizations interested in comparing their results. Thirty-eight groups including representatives from nine different countries participated in TREC-5 in November, 1996.

TREC has two main tasks, ad hoc and routing retrieval. The ad hoc task investigates the performance of systems that search a static set of documents using novel queries; the routing task investigates the performance of systems that use standing queries to search new streams of documents. In addition, TREC has smaller "tracks" that allow participants to focus on particular subproblems of the retrieval task. Recent track tasks have included Spanish retrieval, Chinese retrieval, the use of natural language processing techniques for retrieval, and retrieval of documents that result from paper documents being scanned by an Optical Character Recognition (OCR) process.

The retrieval of OCR documents was the focus of the TREC-5 "Confusion" track. The Confusion Track investigated methods for retrieving document surrogates whose true content has been confused or corrupted in some way. A different form of corruption will be used in TREC-6: retrieving spoken documents (i.e., recordings of speech) through surrogates produced by speech recognition systems. This new track, the Spoken Document Retrieval (SDR) track, is intended to foster research on retrieval methodologies for spoken documents. A second goal of the track is to encourage collaboration between the speech and retrieval research communities.

This paper defines the particular task to be addressed in the SDR Track and motivates the track's design. A detailed specification of the track, including sign-up procedures, samples of the data formats, and particulars of result submission, can be found at www.itl.nist.gov/div894/894.01/sdr97.txt More information about TREC itself can be found at www-nlpir.nist.gov/trec Questions about the track can be sent to either (or both) of the track organizers at john.garofolo@nist.gov or ellen.voorhees@nist.gov.

2. THE SDR TRACK

The SDR Track was designed to encourage as much participation as possible in keeping with TREC's retrieval charter. The track therefore offers two modes of participation: SDR for those with speech recognizers and Q(uasi)SDR for those without. The latter is intended as a startup for those in the present retrieval community without immediate access to speech processing expertise.1 While offering both options limits the experimental comparisons that can be made among groups and complicates the track definition, we anticipate that it will greatly expand the number of retrieval methodologies represented in the track.

2.1 Documents

The track will use stories (i.e., documents) taken from the Linguistic Data Consortium (LDC) 1996 Broadcast News corpus. This data was used in the November 1996 "Hub-4 DARPA Speech Recognition Evaluation [1,2]. The test set will consist of about 1000 stories representing 50 hours of recorded material. A story is generally defined as a continuous stretch of news material with the same content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), which will have been established by hand segmentation of the news programs. Note, however, that some stories such as news summaries may contain topically varying material, and that a story is likely to involve more than one speaker, include background music or noise, etc.

There will be four forms of the story data supplied for the track as shown in Figure 1 and described below.

Figure 1: Data Flow in the SDR Track

All of the above file types will be cross-linked by SGML-tagged time markers for story beginnings and ends. In addition, the following auxiliary files will be provided:

2.2 The Task

A particular type of retrieval problem, called known-item searching, will be used in the SDR track. A known-item search is a retrieval task that simulates a user seeking a particular, partially-remembered document in the collection. In contrast to a more standard retrieval search where the goal is to retrieve/rank the entire set of documents that pertain to a particular subject of interest, the goal in the known-item search is to retrieve one particular document.

Figure 2: Example Known-item Topics from the TREC-5 Confusion Track

  • Use of solar power by the Florida energy office.
  • Excessive mark up of zero coupon treasury bonds.
  • I am looking for a document about the dismissal of a lawsuit involving Adventist Health Systems.
  • I am looking for theft data on the Chevrolet Corsica.
  • efforts to establish cooperative breeding programs for the yellow crowned amazon parrot.
  • morphological similarities between different populations of saltwater crocodiles.

Known item searches were successfully used in the TREC-5 Confusion Track. Indeed, the searches are well-suited to this problem. When document content is corrupted, be it by OCR or speech processing, low-frequency words such as proper nouns and technical terms are the most affected. Yet low-frequency words are high-content-bearing words, and are precisely the words likely to be used to locate a specific document. Thus known-item searches exercise the parts of the retrieval methodologies that the track is most interested in. As a bonus, the searches do not require relevance assessments. Clearly, this obviates the need for relevance assessor's time -- a critical resource at NIST. But it also means the track can run with fewer participants: since TREC uses pooled results to approximate exhaustive relevance assessments, the quality of the relevance assessments depends on the diversity of the pool and hence on the number of participants.

The search topics2 will be produced at NIST and will be designed such that the author believes there is exactly one document in the collection that matches the topic. We expect there to be 50 topics created for the track. Example topics taken from the TREC-5 Confusion Track are given in Figure 2. Participants must use written forms of the topics for the required track runs. However, experiments with participants' own spoken versions are also welcome.

The set of retrieval runs for which results are to be submitted is given below. A retrieval run consists of running search queries for each of the 50 topics against a particular document set (see Figure 1). The set of runs required by the track were selected both to capture retrieval performance and to allow comparison between and within the SDR and QSDR Groups. We hope to gain insight into not only the overall performance levels obtainable, but also into how the speech recognition strategy and the retrieval strategy individually contribute to retrieval performance. The required retrieval runs are:

Participants may optionally submit a second Speech run and a second Baseline run to test the effects of variations in their own system parameter settings. These required runs support retrieval performance comparisons as follows: Together these runs permit a variety of comparisons to be made. The Lexical TREC Transcription text runs demonstrate what the level of performance would be for the given documents and topics with a perfect speech recognizer and the teams' various retrieval strategies. On the other hand, the baseline and individual recognizer runs demonstrate the effects the various recognizers have on retrieval performance.

2.3 Evaluation

Despite the fact that the track is using known-item searches, participants will be required to submit a full ranking of the collection (ordered in decreasing likelihood of the document being the known item) for each topic for evaluation. Experience has shown that a measure of simply success/fail for the first document retrieved is too stringent for both plausible topics and the realities of speech or retrieval systems. In addition, participants in the SDR Group will also submit recognizer output that will be evaluated using the traditional DARPA/NIST CSR word-error based metrics [3].

Since the traditional retrieval effectiveness measures of recall and precision are uninformative for known-item searches, other measures must be used. We investigated three different measures in the TREC-5 Confusion Track [4], and these measures will be used in the SDR Track as well. Other measures may also be introduced in the SDR track if further research produces more appropriate measures. In all cases, the measures are based on the ranks assigned to the known items. A sample evaluation from the TREC-5 confusion track is shown in Table 1 and Table 2.

Table 1: Raw Ranks of an Example TREC-5 Confusion Track Submission
Correct5%20%
1112
281544
3221
451124
5111
6111
7222
8111
9262
10118
11112
12112592
13333
142197
1514622000
16111
171711
181210
191193
20119
21229
22113
23112
24674
25111
26112
271118
28613384
30163960
31223
3271029
33111
341323
35111
36194981
371137
381923
39115342
4026138435
41115
42115
4311314
44111
45111
46111032000
47266119186
4823425
49116
5052156

Table 2: Evaluation Measures Computed for the Example TREC-5 Confusion Track Submission
Histogram
Number of items found at rank r where
Correct5%20%
1 <= r <= 10453727
10 < r <= 1003713
100 < r <= 1000157
Not found002

The task in the TREC-5 Confusion Track was to rank the top 1000 documents per topic on each of three different versions of the 1994 Federal Register: the correct copy, a scanned copy that had approximately a 5% character error rate, and a scanned copy that had approximately a 20% character error rate. These sets of documents correspond to the LTT and two different SRT transcripts in the SDR Track. The rank at which the known item was retrieved for each of the three versions for all 49 topics for the Confusion Track example is given in Table 1. A document that was not retrieved at all in the top 1000 documents was assigned a rank of 2000. Sites are asked to rank the entire collection in the SDR Track since this will preclude the need for an artificial "not retrieved" rank and thus eliminate discontinuities in the effectiveness measures.

The raw ranks are used to compute the measures given in Table 2. The first measure, "Histogram", counts the number of topics for which the known item was found in a certain range of ranks. Since the SDR Track will not have a "Not found" category (the full collection is ranked), we will use the ranges of 1-5, 6-10, 11-20, 21-100, and over 100. The overlapping categories in the histogram permit the histogram counts to be compared across systems (system A may have fewer documents found in ranks 6-10 than system B because it has more documents found in ranks 1-5). The histogram counts are then equivalent to precision after 5 documents retrieved, after 10 documents retrieved, etc., which are common measures used in the rest of TREC.

The second measure, labeled "Mean rank when found", is the mean rank at which the known item was found averaged across all topics that retrieved the known item in the top 1000 documents. This measure gives an easily-interpreted idea of how well the retrieval methodology ranks the known item if it finds it at all. Since the SDR track will rank all documents, the average will always be computed over all 50 topics. (When the average is computed over all topics, this measure is also known as expected run length.)

The last measure is called the "Mean reciprocal rank". It is the mean of the reciprocal of the rank at which the known item was found over all the topics, using 0 (not 1/2000) as the reciprocal for topics that did not retrieve the known document. Unlike the mean rank when found measure, this measure penalizes runs that did not retrieve a known item while minimizing the difference between, say, retrieving a known item at rank 750 and retrieving it at rank 900. It is also bounded between 1 and 0, inclusive, so the measure is interpretable without knowing how many documents were ranked. Indeed, since there is only one relevant document per query, the reciprocal rank of that document is the precision at that document, and therefore it is the average precision of the query as well (average precision is the precision averaged over all relevant documents of the query). Average precision is another frequently used measure in the other parts of TREC, so "mean reciprocal rank" gives some basis of comparison with other retrieval methods.

3. CONCLUSION

The Spoken Document Retrieval Track is intended to foster research on indexing and retrieving spoken documents. While the SDR problem has parallels to the problem of retrieving documents that have been corrupted due to OCR errors, solutions to the two problems are likely to be quite different since the nature of the corruption differs in the two cases. Whereas OCR errors tend to turn words into non-words, speech recognition errors tend to substitute other actual words for correct words.

The TREC-6 SDR Track is the initial offering of a spoken document retrieval track and as such must be viewed as something of an experiment itself. The results of the track are sure to be preliminary if only because a 1000-document collection -- a formidable challenge to produce from 50 hours of speech -- is very small for retrieval experiments. But we strongly encourage active participation in this track in order to gain sufficient experience with the SDR problem to guide future research.

REFERENCES

1. Garofolo, J.S., Fiscus, J.G., and Fisher, W.M., "Design and preparation of the 1996 HUB-4 broadcast news benchmark test corpora," Proceedings of the DARPA 1997 Speech Recognition Workshop, 1997.

2. Graff, D., Wu, Z., MacIntyre, R., and Liberman, M., "The 1996 broadcast news speech and language-model corpus", Proceedings of the DARPA 1997 Speech Recognition Workshop, 1997.

3. Pallett, D.S., et al., "1996 Preliminary broadcast news benchmark tests", Proceedings of the DARPA 1997 Speech Recognition Workshop, 1997.

4. Kantor, P., and Voorhees, E.M., "The TREC-5 confusion track", The Fifth Text REtrieval Conference (TREC-5), to appear.

FOOTNOTES

1 As this is a TREC track, all participants are required to produce retrieval output. Those in the speech community who do not have their own retrieval system may use a commercial retrieval system or a publically available system such as NIST's ZPRISE system.

2 Statements of information need are called "topics" in TREC to distinguish them from "queries" that actually get submitted to retrieval systems.