Keeping with the annual Text REtrieval Conference (TREC), think of an information retrieval system as receiving input consisting of a collection of words called a query and producing from the document collection an output consisting of 1000 documents ranked according to perceived relevance to the query. As was done in the Query Track of TREC-8, one can vary the queries and see the effect on the documents retrieved. Such experiments are complicated by the fact that the space of queries is unstructured and the output is multivariate.
On the basis of human judgements as to what documents are actually relevant, we look at differences between outputs in two ways, one that emphasizes the irrelevant documents returned and the other the order of relevant documents returned. For each system output, we measure rejection of irrelevant documents in a way analogous to survival analysis, by counting the number of irrelevant documents that occur before 25 percent of the relevant documents, taking the logarithm, and finally inverting the sign so that higher is better. We judge the query-to-query difference in the order in which relevant documents are returned using the distance corresponding to Spearman's rank correlation coefficient. In forming this distance, the number of intervening irrelevant documents is ignored. We formed a composite distance from the distances given by three of eight systems. An example of these two ways of comparing queries is shown on the next page.
The top graph shows a strip plot with the queries along the vertical axis and our irrelevant-rejection measure along the horizontal axis. Some of the queries contain misspellings. There is an open circle for each of the eight systems that participated in the Query Track. Clearly, inclusion of the word ``Greenpeace" in the query is essential for superior rejection of irrelevant documents. A substitute such as query N does not work. The bottom graph shows association of the queries through multidimensional scaling based on the order of relevant document return. The queries with the word ``Greenpeace" are on the right. Queries B, C, G, J, and S contain the word ``environment" in addition to ``Greenpeace." Queries D, H, P, T, and U contain no specific words in addition. Query R contains the word ``anti-nuclear" in addition. Thus, the bottom graph shows the influence of the words ``environment" and ``anti-nuclear" on the retrieval order of the relevant documents.
Using the same type of graphs, we studied the 50 topics in the Query Track with the idea of finding topic-independent generalizations useful in improving systems and in guiding users in query formulation. We found that query performance depends on the presence or absence of a few key terms. We found cases where additional key terms improved performance, where additional terms degraded performance, where substantial changes in query meaning changed performance moderately, and where systems distinguished queries that a human would not.
Figure 3: Query Performance and Multidimensional Scaling
Date created: 7/20/2001