NIST 2007 Automatic Content Extraction Evaluation Official Results
(ACE07)

Date of Release: Wednesday May 2nd, 2007

Version 2


Release History

  • Version 2: First public release of official ACE-07 evaluation results
  • Version 1b: Updated release of official results to the evaluation participants
  • Version 1: Initial release of official results to the evaluation participants

  • Introduction

    The NIST 2007 Automatic Content Extraction Evaluation (ACE07) was part of an ongoing series of evaluations dedicated to the development of technologies that automatically infer meaning from language data. NIST conducts these evaluations in order to support information extraction research and help advance the state-of-the-art in information extraction technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.

    Disclaimer

    These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE07 was an evaluation of research algorithms, the ACE07 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated.

    The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise.


    Evaluation Tasks

    The ACE07 evaluation consisted of five main tasks, three mention level tasks, and three diagnostic tasks (not reviewed here). These tasks required systems to process language data in documents and then to output, for each of these documents, information about the entities, values, temporal expressions, relations and events mentioned or discussed in them. These tasks were evaluated separately for each of the ACE languages (Arabic, Chinese, English, and Spanish).

    View the official ACE evaluation specification document for a complete listing of the evaluation protocols and the list of language and task combinations evaluated (page 5, table 10).

    Evaluation Data

    ACE-07 reused the entire evaluation test set from ACE-05. In addition, new data was included that came from the REFLEX corpus (a three-way translated/annotated data set between English, Arabic, and Chinese sources). This added data originated from the ACE-05 data so 10k words of Arabic were translated into both Chinese and English, annotated for ACE entities and temporal expressions and used for the evaluation.

    Source Data

    There were four evaluation source sets, one for each language under test. The English sources contained approximately 70,000 words from the following domains:

    Both the Arabic and Chinese source sets contained approximately 70,000 words (1.5 Chinese characters = 1 word) from the following domains:

    All of the Spanish source data was taken from Newswire data and the test set size was approximately 50,000 words.

    Reference Data

    All of the ACE test data were fully annotated by the Linguistic Data Consortium.


    Performance Measurement

    Each ACE task is a composite task involving detection and one or more of recognition, clustering, and normalization. Multiple attributes are considered important and individually measured with the overall performance being measured using a value formula that applies weights to each attribute.

    The value score is defined to be the sum of all values of all of the system's output tokens, normalized by the sum of the values of the reference data. The possible value of a system output token depends on how closely it matches that of the reference token to which it is mapped. A value score can range from a negative score up to 100%.

    View the appendices of the official ACE evaluation specification document for a complete description of the ACE scoring formulas.


    Evaluation Participants

    The table below lists the sites that registered and processed the evaluation test data for one or more of the five main ACE tasks for the 2007 ACE evaluation. The letters 'A', 'C', 'E', and 'S' are used to identify which language(s) were processed at each site.

    Site entities relations events time values emd rmd vmd
    BBN Technologies AE E E . . AE E E
    # Chinese Academy of Science -
    Institute of Automation
    . . . C . C . .
    # Chinese Academy of Science -
    Institute of Software
    C C . . . C . .
    Fudan University CE . . . . CE . .
    IBM AES . . E . AES . .
    Language Computer Corporation CE C . . . ACE CE .
    Lockheed Martin CE . . CE . CE . .
    * Macquarie University . . . E . . . .
    # Northeastern University of China C . . . . C . .
    # Polytechnic University of Hong Kong C . . C . C . .
    SAIC E . . . . E . .
    SUNY - University of Albany . C . . . . . .
    Technical University of Catalonia . . . . . E E .
    University of Amsterdam . . . E . . . .
    Universidad Carlos III de Madrid . . . S . . . .
    # XIAMEN University C . . C . . . .

    # Sites with incomplete participation - failed to attending the evaluation workshop. Value scores from these systems are not included here.
    * Site with excused absence from evaluation workshop - medical.


    Evaluation Results for the MAIN ACE tasks

    The tables below list the official results of the NIST 2007 Automatic Content Extraction evaluation. Scores for each site's primary system are shown and are ordered by their "overall" value score. In some cases, after preliminary ACE scores were released to the participants bug-fixed systems were submitted. Scores for these revised systems are shown at the bottom of each chart.

    Results for ACE Entities (EDR task):
  • Table 1a - Arabic
  • Table 1b - Chinese
  • Table 1c - English
  • Table 1d - Spanish
  • Results for ACE Relations (RDR task):
  • Table 2a - Chinese
  • Table 2b - English
  • Results for ACE Events (VDR task):
  • Table 3a - English
  • Results for ACE Temporal Expressions (TERN task):
  • Table 4a - Chinese
  • Table 4b - English
  • Table 4c - Spanish
  • There were no participants in the Value task for any language.

    Results for ACE Entity Mention (EMD task):
  • Table 5a - Arabic
  • Table 5b - Chinese
  • Table 5c - English
  • Table 5d - Spanish
  • Results for ACE Relation Mention (RMD task):
  • Table 6a - Chinese
  • Table 6b - English
  • Results for ACE Event Mention (VMD task):
  • Table 7a - English

  • Evaluation of Entities

    The ACE "Entity Detection and Recognition" task requires systems to identify the occurrences of a specified set of entities {Persons, Organizations, Locations, Facilities, Geo-Politicals, Weapons, Vehicles} in the source language documents. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE evaluation specification document .

    Table (1a) list the overall value score for the Arabic evaluation test set, and breaks out the value score for each of the three domains.

    Table 1a - Arabic - Entities
    Site Overall Broadcast News Newswire Weblogs
    BBN Technologies48.851.949.442.1
    IBM45.449.446.634.6
    Note: The Wilcoxon Signed Ranks test comparing system performance at the document level finds that the difference in performance between these two systems is statistically significant at the 95% confidence level. This test was run over ALL documents.

    Table (1b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

    Table 1b - Chinese - Entities
    Site Overall Broadcast News Newswire Weblogs
    Language Computer Corporation45.049.746.935.0
    Fudan University28.835.630.218.4
    Lockheed Martin26.930.326.125.7
    Note: The Wilcoxon Signed Ranks test comparing system performance at the document level finds that the difference in performance between each of these systems is statistically significant at the 95% confidence level. This test was run over ALL documents.

    Table (1c) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 1c - English - Entities
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies56.344.765.458.149.239.252.7
    IBM52.748.765.952.845.444.045.8
    Lockheed Martin46.150.550.046.839.539.742.1
    Fudan University24.221.034.722.934.914.620.7
    #Language Computer Corporation20.125.247.613.08.719.716.9
    # SAIC.......
    Note: The Wilcoxon Signed Ranks test comparing system performance at the document level finds that all differences in system performance are statistically significant at the 95% confidence level, except when comparing Fudan University to Language Computer Corporation, in which case no significant difference is found. This test was run over ALL documents

    The Wilcoxon Signed Ranks test finds that Language Computer Corporation's revised submission lies between the Fudan University submission and the Lockheed Martin submission, in terms of statistical significance.

    Corrected Submission
    #Language Computer Corporation - revised35.825.247.639.38.719.727.3

    # SAIC was a first time participant. Positive value was not achieved.

    # Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission..


    Table (1d) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).

    Table 1d - Spanish - Entities
    Site Overall (newswire)
    IBM51.0

    Evaluation of Relations

    The ACE "Relation Detection and Recognition" task requires systems to identify the occurrences of a specified set of relations {Artifacts, GEN-Affiliation, Metonymy, Org-Affiliation, Part-Whole, Person-Social, Physical} in the source language documents. Complete descriptions of ACE relations and relation attributes that were to be detected can be found in the ACE evaluation specification document .

    Table (2a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 2a - English - Relations
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies21.611.024.721.232.419.618.2

    Table (2b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

    Table 2b - Chinese - Relations
    Site Overall Broadcast News Newswire Weblogs
    Language Computer Corporation17.616.818.715.7
    #SUNY - University of Albany....

    #SUNY - University of Albany was a first time participant. Positive value was not achieved.


    Evaluation of Events

    The ACE "Event Detection and Recognition" task requires systems to identify the occurrences of a specified set of events {Life, Movement, Transaction, Business, Conflict, Contact, Personnel, Justice} in the source language documents. Complete descriptions of ACE events and event attributes that are to be detected can be found in the ACE evaluation specification document .

    Table (3a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 3a - English - Events
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies13.47.412.915.96.611.315.0


    Evaluation of Temporal Expressions TERN

    The ACE "TERN" task requires systems to identify the occurrences of a specified set of temporal expressions and specific attributes about the expressions {Value, Modifier, Anchor value, Anchor directionality, Set} and to normalize the expressions. Complete descriptions of ACE TERN and TERN attributes that were to be detected can be found in the ACE evaluation specification document .

    Table (4a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

    Table 4a - Chinese - TERN
    Site Overall Newswire Weblogs
    Lockheed Martin4.03.45.1
    Corrected Submission
    Lockheed Martin (revised)14.89.924.8

    Table (4b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 4b - English - TERN
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    Lockheed Martin61.644.268.467.452.663.151.4
    IBM59.348.268.660.960.258.252.9
    University of Amsterdam45.046.667.832.264.254.144.4
    Macquarie University24.220.943.421.938.726.611.0

    Note: The Wilcoxon Signed Ranks test to compare system performance at the document level finds no difference is system performance between Lockheed Martin and IBM, but all other comparisons found the differences to be statistically significant at the 95% confidence level. This test was run over ALL documents.

    The Wilcoxon Signed Ranks test finds that the University of Amsterdam's revised submission is no different than IBM's original submission, in terms of statistical significance. The University of Amsterdam's revised submission was found to be significantly different from Macquarie University's revised submission.

    Corrected Submissions
    # University of Amsterdam (revised)58.246.667.857.364.259.054.8
    # Macquarie University (revised)48.330.044.454.238.755.944.8

    Table (4c) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).

    Table 4c - Spanish - TERN
    Site Overall (newswire)
    Universidad Carlos III de Madrid46.5


    Evaluation Results for the ACE Mention Level Tasks

    The mention level tasks are designed to measure a system's ability to correctly identify all mentions of the ACE entities, relations, and events. The same set of types and attributes listed above apply to the mention level tasks.

    The tables below list the official results of the NIST 2007 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.


    Evaluation of Entity Mentions

    Table (5a) list the overall value score for the Arabic evaluation test set, and breaks out the value score for each of the three domains.

    Table 5a - Arabic - Entity Mentions
    Site Overall Broadcast News Newswire Weblogs
    BBN Technologies73.478.773.264.1
    IBM71.978.672.557.1
    Language Computer Corporation67.376.166.254.4
    Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that each of the system differences are statistically significant at the 95% confidence level. This test was run over ALL documents.

    Table (5b) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

    Table 5b - Chinese - Entity Mentions
    Site Overall Broadcast News Newswire Weblogs
    Language Computer Corporation76.780.776.872.0
    Fudan University59.467.259.750.5
    Lockheed Martin42.048.340.838.1
    Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that each of the system differences are statistically significant at the 95% confidence level. This test was run over ALL documents.

    Table (5c) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 5c - English - Entity Mentions
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    IBM82.987.085.482.893.872.577.3
    BBN Technologies81.283.682.580.692.770.978.5
    Technical University of Catalonia75.080.278.773.893.160.269.2
    Lockheed Martin67.371.369.765.888.255.061.7
    #Language Computer Corporation64.483.283.351.193.771.362.4
    Fudan University42.337.649.942.250.328.039.5
    # SAIC.......

    Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, finds that most of the differences in system performance are found to be statistically significant at the 95% confidence level. An exception exists with LCC. The LCC system is found to better than the Lockheed Martin system and equivalent to the Technical University of Catalonia system. This test was run over ALL documents.

    The Wilcoxon Signed Ranks test finds that Language Computer Corporation's revised submission is no different than BBN's original submission, in terms of statistical significance. Comparisons with each of the other systems does find a difference.

    Corrected Submission
    #Language Computer Corporation - revised80.983.283.380.493.771.376.6

    # SAIC was a first time participant. Positive value was not achieved.

    # Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission.


    Table (5d) list the overall value score for the Spanish evaluation test set (only the newswire domain was used).

    Table 5d - Spanish - Entity Mentions
    Site Overall (newswire)
    IBM78.7

    Evaluation of Relation Mentions

    Table (6a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

    Table 6a - Chinese - Relation Mentions
    Site Overall Broadcast News Newswire Weblogs
    Language Computer Corporation29.728.831.027.3

    Table (6b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 6b - English - Relation Mentions
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies33.424.734.033.742.631.734.8
    Technical University of Catalonia33.124.138.233.243.620.527.8
    #Language Computer Corporation32.529.335.334.638.421.123.0

    Note: The Wilcoxon Signed Ranks test, comparing system performance at the document level, does not find the differences in system performance to be statistically significant. This test was run for ALL documents.

    # The Wilcoxon Signed Ranks test results did not change when using Language Computer Corporation's revised submission.

    Corrected Submission
    # Language Computer Corporation - revised32.525.542.341.041.254.54.6

    # Bug in system caused original submission to have invalid byte offsets resulting in low score. A script corrected these offsets resulting in the revised submission.


    Evaluation of Event Mentions

    Table (7a) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

    Table 7a - English - Event Mentions
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies24.118.225.725.422.119.124.8