School of Computer Science and Information Technology
RMIT University
GPO Box 2476V, Melbourne 3001, Australia
Vaughan R. Shanks Hugh E. Williams
For this year's evaluation we participated only in the topic tracking task. Given a few example stories on a particular topic, the task of a tracking system task is to process a supplied list of data files, and classify all stories in these files as either on-topic or off-topic. This task is repeated for over 100 topics.
There are few examples provided for tracking a topic. Under the TDT evaluation rules, only one, two, or four on-topic examples are provided, and only zero or two off-topic examples. To simulate field conditions for a tracking system, no data may be used that is as recent, or more recent than the data supplied. Therefore, any term statistics or pseudo-relevance feedback used must be gathered from sources published before 1998.
For this year's evaluation, we aimed to implement a baseline system that would produce reasonable, TDT-compliant results. We restricted our problem domain to English language documents and automatically-translated Mandarin; we did not consider any ASR texts, and only processed newswire and manually-transcribed broadcast news stories.
Our secondary goals were efficiency--a similar goal to our recent categorisation research [7]--and to experiment with various term extraction and profiling techniques to see which work better and why.
The system we used was based on a Rocchio [6] routing categoriser we previously developed for fast document categorisation [7]. The system was developed in C and experiments were carried out on a Pentium III based server with 256 megabytes of main-memory. In this environment, our system can process around 10 gigabytes of SGML-formatted text per hour.
Terms were extracted as single words, delimited by whitespace. All terms extracted were case-folded. Words were then stemmed using Porter's algorithm [4]. We found that use of a Porter stemmer is a significant overhead: the throughput rate of our system was reduced to half when the stemmer was employed. We plan in the future to use a simpler or dictionary stemmer for efficiency. A list of 477 common English words was used as a stoplist.
As a background model of news document text, we used newswire documents from
Associated Press, Wall Street Journal, and the Federal Register from TREC
disks 1 and 2 [1].
Term document frequency statistics were gathered and stored to an auxiliary
vocabulary file.
Along with the document frequency (
), we store the
Okapi [5] inverse document frequency of these terms:
A document term frequency (TF) factor is calculated using an Okapi term
weight:
Topics profiles are maintained for each topic.
The topic profile is calculated initially by generating weight vectors for
each of the training examples supplied, then adding the positive, and
subtracting the negative examples, to form one vector.
Following the well-known heuristic [2,3], negative
weights in the profile are zeroed; the Rocchio parameters
and
are therefore both set to 1.0.
The resultant vector is then normalised to unit size.
The similarity
between the topic profile vector,
, and a document vector,
, are calculated using the well-known cosine measure [8]:
The threshold for classification decisions is set by calculating the cosine
measure of each of the training examples against the topic profile.
The maximum off-topic and minimum on-topic values are recorded and a
threshold calculated according to the following formula:
Simple, adaptive profiling was built into the system as an optional feature. With adaptive profiling enabled, the system performs unsupervised learning. Documents that are selected as on-topic--those with a YES decision--are added to the topic profile. Documents that are off-topic--those with a NO decision--are subtracted from the topic profile. A fixed number of the highest weighted terms are retained in an active profile which is used for performing similarity calculations. In our initial implementation described here, translated Mandarin documents were not used in adaptive feedback as it was found that the limited vocabulary in these documents results in skewing of the profile towards recognising all Mandarin documents as on-topic examples.
Document vectors that are used to adapt the profile are scaled-down to
reduce the possibility of a false decision distorting the profile.
A certainty factor is calculated for each decision made as follows:
Optional pseudo-relevance feedback (PRF) is also included in the system, as a method to include more related terms and raise recall. Three gigabytes of TREC data were chosen as a source of PRF. PRF is introduced into the system after training is complete and the topic profile has been normalised, but before a threshold has been chosen.
The MG [8] full-text information retrieval engine (mg-2.1) is used to generate PRF. A ranked query is constructed from the highest-weighted three terms in the topic profile, and the top ten documents returned by a ranked query are processed into document vectors. These document vectors are then added to the topic profile in the same manner as adaptive feedback documents, except that certainty is set to the same value as similarity.
Prior to the release of the TDT2001 evaluation corpus and indexes, the TDT3 data and TDT2001 dry run indexes were used to compare relative system performance under various conditions. All runs performed used English training documents, reference document boundaries, English and translated Mandarin test documents, and manual transcripts of audio sources. The following control files from the dry run indexes were used:
trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=1.ctl trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=2.ctl trk_SR=nwt+bnman_TR=eng_TE=mul,eng_Nt=4.ctlSeveral experiments were performed using
Runs were submitted for evaluation using the parameters shown in
Table 1.
A comparison of results of the runs shown in Table 1 is shown in Table 2 (these are taken from the official NIST TDT evaluation). Statistics displayed are topic weighted and macroaveraged.
|
Table 2 shows that our false alarm rates are very low, while
the miss rates are high, resulting in poor
scores.
We believe that this is due to two factors:
In Table 3, it can be seen that by lowering our threshold, we could have achieved less misses and more false alarms, with an overall improvement in our performance.
|
Our primary evaluation condition run is shown in Figure 1.
Our best evaluation run was p42 and this is shown plotted in Figure 2.
Our runs had high precision, as evidenced by the low false alarm rate. However, recall is the area of deficiency in our runs and this results in only moderate performance overall. One immediate improvement on recall, which can be observed by comparing runs st42 and th842, is to lower the decision threshold.
Our best run used pseudo-relevance feedback.
However, a closer examination of the minimum achievable
shows that this is only because of the threshold point chosen.
It appears that the th842 and st42 runs still have the best
minimum
.
Normalising topic score distributions may prove an effective way to solve
this problem, while a better source of pseudo-relevance feedback may be an
older TDT corpus rather than TREC.
Adaptive profiling appears to have been partly successful, but it is apparent that there is still significant improvement possible. In particular, we have found that it is difficult to balance positive and negative feedback, and to keep the system stable. One obvious improvement would be addition of the ability to recalculate the threshold as the profile changes.
Static profiling has the advantage of being simpler to implement, and faster to run, as we skip all the steps of sorting and renormalising the active profile. However, static profiling is limited in that it cannot handle ``concept drift'' in a topic. We believe that it may be necessary to use adaptive profiling in order to improve our results further.
The adaptive profiling system used here implements a linear scaling function for profile adaptation, based on the certainty of a document being on or off topic. A better method may be to use a sigmoid function, so that borderline documents are totally ignored.
For this year's TDT evaluation, we developed a baseline topic tracking system. Some variations on the baseline were attempted, including adaptive tracking and pseudo-relevance feedback. While our system was efficient and based on our previous work in fast categorisation, the accuracy of the system can be improved substantially.
There are many possible future directions. The most promising directions for increasing accuracy are to improve the thresholding algorithm, normalise topic relevance scores, ambiguate topic profiles, apply higher weightings to proper names, and to recognise some multiword features. We also plan additional efficiency work.
This work was supported by the Australian Research Council, the Australian Defence Science and Technology Organisation, the Australian Defence Signals Directorate, and the Multimedia Database Systems group at RMIT University.
This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.1.1.1)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 rmit.tex
The translation was initiated by Hugh Williams on 2001-11-07