NIST 2008 Automatic Content Extraction Evaluation (ACE08)
Official Results

Date of Release: September 29 2008

Version 2


Release History

  • Version 2: First public release of official ACE08 evaluation results
  • Version 1: Initial release of official results to evaluation participants

  • Introduction

    The NIST 2008 Automatic Content Extraction Evaluation (ACE08) was part of an ongoing series of evaluations dedicated to the development of technologies that automatically infer meaning from language data. NIST conducts these evaluations in order to support information extraction research and help advance the state-of-the-art in information extraction technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.

    Disclaimer

    These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE08 was an evaluation of research algorithms, the ACE08 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated.

    The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise.

    Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.


    Evaluation Tasks

    The ACE08 evaluation consisted of four main tasks. The two within-document tasks required systems to process language data in documents locally, separately for each document, and then to output, for each of these documents separately, information about the entities and relations mentioned or discussed in them. The two cross-documents task required systems to process language data globally, across multiple documents, and to output information about the entities and relations mentioned or discussed in them reconciled across all processed documents. These tasks were offered separately for both of the ACE08 languages, English and Arabic.

    View the official ACE08 evaluation specification document for a complete listing of the evaluation protocols and the list of language and task combinations evaluated (page 5, table 7).

    Evaluation Data

    The ACE08 evaluation source data was selected from source documents originallny published between 1994 and 2006. There were no constraints on the genres to be included. Documents from previous ACE corpora were excluded.

    Source Data

    There were two evaluation source sets, one for each language under test.

    The English source set contained approximately 11,000 documents from the following domains:

    The Arabic source set contained approximately 10,000 documents from the following domains:

    Test Data

    Systems were required to process the large source sets completely. For performance measurement, 415 of the English and 412 of the Arabic source documents were selected out of the larger set. The systems submissions were scored on these subsets. For more details on the selection criteria for the subsets, please consult the ACE08 evaluation specification document.

    Reference Data

    The ACE test data (the 415 English and 412 Arabic document subsets of the larger source data set) was thoroughly annotated by the Linguistic Data Consortium.

    The reference data annotation used for the scores reported here is version 2.1.


    Performance Measurement

    Each ACE task is a composite task involving detection and one or more of recognition, clustering, and normalization. Multiple attributes are considered important and individually measured with the overall performance being measured using a value formula that applies weights to each attribute.

    The ACE Value score is defined to be the sum of all values of all of the system's output tokens, normalized by the sum of the values of the reference data. The possible value of a system output token depends on how closely it matches that of the reference token to which it is mapped. An ACE Value score can range from a perfect score of 100% (for perfect output) down to zero (for no output) or even down to negative scores (for systems that make costly errors).

    View the appendices of the official ACE08 evaluation specification document for a complete description of the ACE scoring formulas.

    The scoring software used to calculate the ACE08 scores is ace08-eval-v17.pl.


    Evaluation Participants

    The table below lists the sites that registered and processed the evaluation test data for one or more of the four main ACE tasks for the 2008 ACE evaluation. The letters 'E' and 'A' are used to identify which language(s) were processed at each site.

    Affiliation LEDRLRDRGEDRGRDR
    Alias-i, Inc. ..E.
    ! AU-KBC Research Center E.withdrawn.
    BBN Technologies E, AE, AE, AE, A
    Fondazione Bruno Kessler EwithdrawnEwithdrawn
    ! France Telecom Research &
    Development Center Beijing
    withdrawnwithdrawnEwithdrawn
    Fudan University E...
    Human Language Technology
    Center of Excellence
    ..E, AE, A
    IBM EEE.
    Pontificia Universidade Catolica
    do Rio de Janeiro, Genesis Institute (Cortex Intelligence)
    E.E.
    TEMIS A...

    ! Designates incomplete participation (failed to attend the evaluation workshop)

    There were six sites that registered for the evaluation but did not submit any results. These are not included here.


    Evaluation Results for the Main ACE Tasks

    The tables below list the official results of the NIST 2008 Automatic Content Extraction evaluation. Scores for each site's primary system are shown and are ordered by their "overall" ACE Value score. Scores shown are for the final best effort submission, which may include bug fixes and late submission. Submissions made on-time and unmodified have a green marker #.

    Note: There was an annotation inconsistency for some metadata in ACE08 between the designated training data and the test data. Correcting this inconsistency in several high-scoring sample systems for English indicates that an increase in ACE Value over the scores reported below could be expected from the correction. For the sample systems, the ACE Value increase was on the order of 3-5 percentage points for English Local EDR, 7-9 percentage points for English Global EDR, 1 percentage point for English Local RDR, and 2 percentage points for English Global RDR, compared to the results reported below.

    Results for ACE08 Local Entity Detection & Recognition (LEDR):
  • Table 1a - English
  • Table 1b - Arabic
  • Results for ACE08 Local Relation Detection & Recognition (LRDR):
  • Table 2a - English
  • Table 2b - Arabic
  • Results for ACE08 Global Entity Detection & Recognition (GEDR):
  • Table 3a - English
  • Table 3b - Arabic
  • Results for ACE08 Global Relation Detection & Recognition (GRDR):
  • Table 4a - English
  • Table 4b - Arabic

  • Local Entity Detection & Recognition (LEDR)

    The ACE08 Local Entity Detection and Recognition (LEDR) task requires systems to identify the occurrences of a specified set of entity types (Persons, Organizations, Locations, Facilities, and Geo-Politicals Entities) in the source language documents, separately for each document. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE08 evaluation specification document.

    Table (1a) lists the overall ACE Value score for the English Local EDR task, ordered by descending overall ACE Value but with on-time unmodified submissions first, and breaks down the Value score by data domain:

    Table 1a - English - Local EDR
    Site Overall Broadcast Conversations Broadcast News Meetings Newswire Telephone Usenet Weblogs
    # IBM50.8%44.6%37.7%-11.9%58.1%26.1%25.5%51.0%
    BBN Technologies52.6%42.0%36.9%-44.2%61.3%22.1%31.1%54.8%
    Fudan University-17.6%-45.1%-43.3%-441.2%9.0%-197.0%-48.4%4.4%
    Pontificia Universidade Catolica do Rio de Janeiro, Genesis Institute (Cortex Intelligence)-46.3%-54.7%-21.1%-57.6%-64.9%-18.6%-48.8%10.1%
    Fondazione Bruno Kessler-90.0%-148.1%-98.8%-404.6%-63.8%-436.2%-83.0%-43.5%
    ! AU-KBC Research Center-269.1%-340.0%-279.7%-911.4%-188.3%-999.9*%-279.6%-177.4%

    # On-time unmodified submission
    ! Site with incomplete participation (no evaluation workshop attendance)
    * -999.9 represents a cut-off for negative ACE Value.

    Table (1b) lists the overall ACE Value score for the Arabic Local EDR task, ordered by descending overall ACE Value, and breaks down the Value score by data domain:

    Table 1b - Arabic - Local EDR
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies43.6%33.2%38.4%49.8%-64.1%24.4%23.9%
    TEMIS-9.1%-31.7%2.6%-5.4%-31.5%-8.2%-13.4%

    Local Relation Detection & Recogntion (LRDR)

    The ACE08 Local Relation Detection and Recognition (LRDR) task requires systems to identify the occurrences of a specified set of relations (Agent-Artifact, General-Affiliation, Metonymy, Organization-Affiliation, Part-Whole, Person-Social, Physical) in the source language documents, separately for each document. Complete descriptions of ACE relations and relation attributes that are to be detected can be found in the ACE08 evaluation specification document.

    Table (2a) lists the overall ACE Value score for the English Local RDR task, ordered by descending overall ACE Value but with on-time unmodified submissions first, and breaks down the Value score by data domain:

    Table 2a - English - Local RDR
    Site Overall Broadcast Conversations Broadcast News Meetings Newswire Telephone Usenet Weblogs
    # IBM3.8%-7.5%6.1%-11.1%5.5%-16.7%-4.9%2.6%
    BBN Technologies11.6%10.8%11.0%-17.5%13.7%-4.7%-0.7%8.7%

    # On-time unmodified submission


    Table (2b) lists the overall ACE Value score for the Arabic Local RDR task, and breaks down the Value score by data domain:

    Table 2b - Arabic - Local RDR
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs
    BBN Technologies5.5%-1.1%8.2%6.4%-22.3%-3.5%3.1%

    Global Entity Detection & Recognition (GEDR)

    The ACE08 Global Entity Detection and Recognition (GEDR) task requires systems to identify the occurrences of a specified limited set of entity types (Persons and Organizations) in the source documents and output them globally reconciled across all processed documents. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE08 evaluation specification document.

    Table (3a) lists the overall ACE Value score for the English Global EDR task, ordered by descending overall ACE Value but with on-time unmodified submissions first, and breaks down the Value score by data domain:

    Table 3a - English - Global EDR
    Site Overall Broadcast Conversations Broadcast News Meetings Newswire Telephone Usenet Weblogs Multi-domain
    # IBM51.0%46.0%21.2%-60.7%57.4%-20.2%17.7%33.5%63.4%
    # Alias-i, Inc.1.2%-8.2%-35.7%-133.6%2.3%-60.1%-101.3%-9.2%24.6%
    BBN Technologies53.0%-6.0%9.5%-154.3%65.8%-20.3%9.7%31.7%66.4%
    Human Language Technology Center of Excellence42.0%7.1%1.6%-152.8%57.7%-16.6%10.1%26.5%46.6%
    Fondazione Bruno Kessler34.2%21.4%4.7%-93.8%40.1%-28.9%1.6%14.6%48.0%
    ! France Telecom Research & Development Center Beijing30.6%37.9%15.0%-89.9%34.7%-1.0%-11.8%19.1%39.6%
    Pontificia Universidade Catolica do Rio de Janeiro, Genesis Institute (Cortex Intelligence)-64.1%-24.7%-72.4%-151.6%-22.7%-7.7%-127.3%-56.8%-118.9%

    # On-time unmodified submission
    ! Site with incomplete participation (no evaluation workshop attendance)

    Table (3b) lists the overall ACE Value score for the Arabic Global EDR task, ordered by descending overall ACE Value, and breaks down the Value score by data domain:

    Table 3b - Arabic - Global EDR
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs Multi-domain
    BBN Technologies28.2%13.7%12.6%38.9%-167.1%4.8%27.9%19.2%
    Human Language Technology Center of Excellence21.2%3.6%-3.7%34.9%-176.7%7.2%26.8%3.9%

    Global Relation Detection & Recognition (GRDR)

    The ACE08 Global Relation Detection and Recognition (GRDR) task requires systems to identify the occurrences of a specified set of relations (Agent-Artifact, General-Affiliation, Metonymy, Organization-Affiliation, Part-Whole, Person-Social, Physical) in the source language documents, and output them globally reconciled across all processed documents. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE08 evaluation specification document.

    Table (4a) lists the overall ACE Value score for the English Global RDR task, ordered by descending by overall ACE Value, and breaks down the Value score by data domain:

    Table 4a - English - Global RDR
    Site Overall Broadcast Conversations Broadcast News Meetings Newswire Telephone Usenet Weblogs Multi-domain
    BBN Technologies10.1%11.0%29.1%0.0%7.1%-8.3%-8.2%-2.8%24.7%
    Human Language Technology Center of Excellence-16.1%-30.0%-31.7%0.0%-13.3%-8.3%-13.3%-22.4%-12.0%

    Table (4b) lists the overall ACE Value score for the Arabic Global RDR task, ordered by descending by overall ACE Value, and breaks down the Value score by data domain:

    Table 4b - Arabic - Global RDR
    Site Overall Broadcast Conversations Broadcast News Newswire Telephone Usenet Weblogs Multi-domain
    BBN Technologies-7.4%-18.8%-3.7%-6.1%-75.0%-22.5%-5.8%-18.3%
    Human Language Technology Center of Excellence-9.7%-18.8%-9.5%-8.5%-75.0%-22.5%-5.8%-18.3%

    Disclaimer

    These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE08 was an evaluation of research algorithms, the ACE08 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated.

    The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise.

    Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.