This file summarizes the 2004 pilot for evaluation of relationship questions, performed by NIST and AQUAINT program contractors. The first part of the summary gives an overview of the pilot, and the second part describes the format of the data files resulting from the pilot. The data files include: 1. a set of 50 relationship topics and an answer to each one as compiled by their authors ("answers") 2. a list of concepts that should be included in the evidence for the answers to the questions for each topic, as determined by an assessor ("nuggets") 3. the judging of the systems' responses as determined by an assessor (".judged") 4. one or more document ids for each nugget of evidence ("autonuggetdocs" and "manualnuggetdocs") 5. pdf of slides presented at the AQUAINT breakout session on relationship questions, held on October 6, 2004 ("relpilot-1004.pdf"). 1. AQUAINT Pilot for Relationship Questions =========================================== The purpose of the relationship questions pilot was to examine the issues involved in evaluating how well computer systems can locate ``evidence'' for certain kinds of relationships from a collection of documents. A group meeting with analysts in early 2004 resulted in the understanding that a "relationship" is the ability of one object to influence another. Evidence for a relationship includes both the means to influence something and the motivation for doing so. Relationships can involve both entities and events. Eight types of relationships ("spheres of influence") were noted: * financial * movement of goods * family ties * communication pathways * organizational ties * co-location * common interests * temporal connection The particular relationships of interest depend on the analyst, situation, and purpose. A major concern is recognizing when evidence for a suspected tie is lacking and determining whether the lack is because the tie doesn't exist or because it is being hidden or overlooked. The analyst needs sufficient information to establish confidence in any evidence given. Pilot Task Description ---------------------- The AQUAINT relationship pilot used TREC-like "topic" statements to set up a context for each question. The topics were developed by 13 analysts from the military who searched the AQUAINT collection looking for appropriate topics. The AQUAINT collection covers the time period of 1998--2000 and consists of news stories taken from the New York Times, the AP newswire, and the English portion of the Xinhua newswire (see LDC catalog number LDC2002T31). The military analysts created mini-scenarios that had relevant information contained in the collection. From these scenarios, an analyst from the NSA selected and developed 50 topics that were used for the evaluation pilot. Each topic statement set the context for a question that asked for evidence for one of the types of relationships listed above. The question was either a yes/no question that was to be understood as a request for evidence supporting the answer ("[w]ill the Japanese use force to defend the Senkakus?"), or a request for the evidence itself ("What types of disputes or conflict between the PLA and Hong Kong residents have been reported?). Sometimes multiple subquestions were embedded in a single topic ("Who were the participants in this spy ring, and how are they related to each other?"). The participating systems were given 50 relationship topics and the AQUAINT document collection. Systems were to return one list of text snippets per topic such that each item in the list was a piece of evidence supporting the answer to the question in the topic. There were no limits placed on either the length of an individual snippet or on the number of snippets in a list, though systems knew they would be penalized for retrieving extraneous information. The format of a response was the same as for the AQUAINT Definition Pilot, namely a file containing lines of the form topic-number run-tag doc-id evidence-string run-tag is a string that is used as a unique identifier for the run; evidence-string is the piece of evidence derived (extracted, concluded, etc.) from the given document. Evaluation of System Responses ------------------------------ Evaluation of the nuggets returned by the systems was done as in the AQUAINT Definition Pilot. (See the bottom of http://trec.nist.gov/data/qa/add_qaresources.html for details.) For each topic, an assessor first used the topic author's answer and the responses from all the systems to create a list of ``information nuggets'' representing evidence for the answer. An information nugget was defined as a fact for which the assessor could make a binary decision as to whether a response contained the nugget. The assessor then decided which nuggets were vital pieces of evidence and which were merely "okay". Finally, the assessor went through each of the system responses and marked where each nugget appeared in the response. If a system returned a particular nugget more than once, it was marked only once. Precision and recall for a response were computed over the nuggets. Recall was computed as the ratio of the number of correct vital nuggets retrieved to the number of vital nuggets in the assessor's list. Precision was approximated using the length of the response. The length-based measure gave an allowance of 100 (non-white-space) characters for each vital or okay nugget retrieved. The precision score was set to one if the response was no longer than this allowance. If the response was longer than the allowance, the precision score was downgraded using the function 1 - [(length-allowance)/length]. The final score for a response was computed using the F-measure, a function of both recall (R) and precision (P). The general version of the F-measure is F = (beta^2+1)RP / (beta^2P + R) where beta is a parameter signifying the relative importance of recall and precision. The evaluation in the pilot used a value of beta=3, indicating that recall is 3 times as important as precision. System Results -------------- Four groups participated in the pilot, submitting a total of six runs (labeled A-F). The following table shows the average response length, F-score (beta=3), recall, and precision for each run: Run Avg. length F(b=3) Recall Precision --- ----------- ------ ------ --------- Run-A 527.48 0.429 0.4595 0.42616 Run-B 929.78 0.393 0.45918 0.3037 Run-C 3984.74 0.391 0.66036 0.10722 Run-D 850.12 0.302 0.35482 0.23086 Run-E 689.16 0.298 0.34448 0.26058 Run-F 855 0.292 0.34418 0.22878 Because the scoring metric favored recall over precision, systems were able to improve their overall rankings by returning long responses. In particular, Run-C returned responses that were four times longer than the other runs, and managed to rank third overall despite having the lowest precision of all the runs. However, not all the runs relied on long responses to achieve a high score. The highest-scoring run (Run-A) also returned the shortest responses. Between the six runs, the systems were able to find over 85% of the 151 vital nuggets in the assessors' list. However, automatic detection of relationships could be further improved by improving systems' ability to do co-reference of entities and events across documents, which in turn could be aided by reasoning about temporal relations. For example, Topic 23 asked for the list of members of a coalition called the "Coalition to Stop the Use of Child Soldiers." All the relevant documents retrieved by the systems were dated on or after Tuesday, June 30, 1998, the date when the coalition was launched. However, systems failed to retrieve a document published the day before, which referred to the impending Tuesday launch of the [unnamed] coalition and listed the groups involved (a number of which were not mentioned in the later documents). Event detection and tracking also could have helped for Topic 49, which asked if the MILF had any dealings with the government of Indonesia in the 1990's. Systems returned evidence for a planned meeting between the MILF leader and the Indonesian president, but failed to retrieve a later document containing vital evidence that the planned meeting was canceled. Conclusion ---------- Participants in this relationship pilot generally expressed satisfaction with the exercise, especially with the additional context provided by the topics. However, there were several requests for document ids for all the nuggets. While NIST reverse-engineered this list after the evaluation, it would ideally by created by the assessors at the time they created the nuggets list. Requiring assessors to include a document id for each nugget would improve the quality of the nuggets. Sometimes the glosses of the nuggets were so vague (e.g., "NBC production") that it would be difficult for anyone other than the author to understand the intended meaning of the "evidence" without reference to a source document. At other times nuggets were non-atomic, making it difficult to calculate concept recall. For example, topic 29 asked for a list of suspected participants in the 1998 bombing of US embassies; the assessor used a vital nugget glossed as "Suspected participants in 1998 bombings of US Embassies" and gave systems full credit for this nugget even if the returned evidence identified only one of the 15 individuals on the list of suspected participants. The nugget approach to evaluation could also be more useful if the assessors marked all returned nuggets, including non-distinct ones. Having this additional information could help participants develop and train their systems. 2. Data Files ============= "answers" --------- The "answers" file is a compilation of notes written by the analysts who created the topics. For each topic there is the topic id, the topic itself, the answer to the question, the document ids of documents containing evidence for the answer, and a gloss of the evidence. "nuggets" --------- The nuggets that were used for assessing the runs are in the "nuggets" file. The format of each line is: topic-number nugnum vital|okay evidence-string where topic-number is the topic number, nugnum is the nugget number, vital|okay indicates whether the nugget of evidence is "vital" or "okay", and evidence-string is the gloss of the evidence. ".judged" ----------------- The assessment of the systems' runs are given in the files .judged, with one file for each run. A judged assessment file contains two parts per topic. The first part repeats what was submitted to NIST with the exception that an "item number" is added. An item number is simply the count of the strings submitted to NIST. Thus the format of this part is topic-number run-tag item-number doc-id evidence-string where topic-number is the topic number; run-tag is the id of the run (A-F); doc-id is the id of the document from which the evidence was drawn; and evidence-string is the text snippet. The second part lists the nuggets that were assigned to each item: topic-number run-tag item-number nugget-number where nugget-number is the number of the nugget as listed in the "nuggets" file. For example, the line 14 Run-A 1 3 means that the third nugget of topic 14 was found in the first item in the response submitted to NIST by Run-A. If a nugget appeared multiple times in a response, it was marked only once with the assessor picking the item he or she thought was the "best" match (for an unknown and arbitrary definition of best). A single item might match multiple nuggets, in which case the item is repeated in the second part. For example: 16 Run-A 1 2 16 Run-A 1 3 means that the first item in the response for topic 16 contained nugget 2 and nugget 3. If no nuggets were found in the response, then the second part will be empty. "autonuggetdocs" and "manualnuggetdocs -------------------------------------- These two files contain document ids for the nuggets in the "nuggets" file and were created after the assessment files were returned to the pilot participants. The document ids either come from the systems' responses or were generated manually based on inspection of the "nuggets" and "answers" files. Each line in the "autonuggetdocs" and "manualnuggetdocs" files is in the format: topic_id nugget_id vital|okay doc_id string where each (topic_id, nugget_id) pair is labeled as either "vital" or "okay", based on the "nuggets" file. "autonuggetdocs": contains system responses for each nugget for which some system returned a matching response. doc_id and string are the document id and evidence string returned by the system that match this nugget_id for this topic_id. Multiple (doc_id, string) pairs may be included, one per line, for each topic_id and nugget_id pair, but duplicates are removed. "manualnuggetdocs": If a nugget was not matched by any system's response, a document id was found manually for the nugget using queries to the document collection, based on the "nuggets" and "answers" files; string is the gloss of the evidence, as given in the "nuggets" file. Only one doc_id is provided for each (topic_id, nugget_id) pair. If no document could be found for the nugget or if the intended meaning of the nugget string was not inferrable, then "UNDOCUMENTED" appears in the doc_id field. (N.B. Nugget 8 of topic 4 seems to actually belong to topic 5.)