NIST 2005 Automatic Content Extraction Evaluation Official Results
(ACE05)

Date of Release: Tue, Jan. 10th, 2006
Version 2


Release History


Introduction

The NIST 2005 Automatic Content Extraction Evaluation (ACE05) was part of an ongoing series of evaluations dedicated to the development of technologies that automatically infer meaning from language data. NIST conducts these evaluations in order to support information extraction research and help advance the state-of-the-art in information extraction technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since ACE05 was an evaluation of research algorithms, the ACE05 test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for scoring. The systems themselves were not evaluated.

The data, protocols, and metrics employed in this evaluation were chosen to support Information Extraction research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems. For that reason, this should not be considered a product testing exercise.


Evaluation Tasks

The ACE05 evaluation consisted of five main tasks, three mention level tasks, and three diagnostic tasks. These tasks require systems to process language data in documents and then to output, for each of these documents, information about the entities, values, temporal expressions, relations, and events mentioned or discussed in them. These tasks are evaluated separately for one or more of the ACE languages (Arabic, Chinese, and/or English).

View the official ACE evaluation specification document for a complete listing of the evaluation protocols and the list of language and task combinations evaluated (page 4, table 8).


Evaluation Data

Source Data

There were three evaluation source sets, one for each language under test. The English source set contained approximately 50,000 words from the following domains:

Both the Arabic and Chinese source sets contained approximately 50,000 words (1.5 Chinese characters = 1 word) from the following domains:

View the official ACE evaluation specification document for a complete listing of the sources and time epochs of the evaluation data (page 6, table 10).

Reference Data

Both the Chinese and English evaluation test sets were fully annotated by the Linguistic Data Consortium. The Arabic evaluation set was annotated by an outside source and a shortage of funding resulted in less than 33% of the evaluation data being properly annotated. For this reason, Arabic results will not be posted since the lack of reference data reduces the statistical power of the results.


Performance Measurement

Each ACE task is a composite task involving detection and one or more of recognition, clustering, and normalization. Multiple attributes are considered important and individually measured with the overall performance being measured using a value formula that applies weights to each attribute.

The value score is defined to be the sum of all values of all of the system's output tokens, normalized by the sum of the values of the reference data. The possible value of a system output token depends on how closely it matches that of the reference token to which it is mapped. A value score can range from a negative score up to 100%.

View the appendices of the official ACE evaluation specification document for a complete description of the ACE scoring formulas.


Evaluation Participants

The table below lists the sites that participated in one or more of the five main ACE tasks for the 2005 Automatic Content Extraction evaluation. The letters 'A', 'C', and 'E' are used to identify which language(s) were processed by each site's system:

Site
entities
relations
events
time
values
University of Amsterdam E - E E -
BBN Technologies ACE ACE CE - CE
#Basis Technology, Inc. ACE - - - -
University of Colorado CE CE - - -
#Harbin Institute of Technology CE - - - -
IBM ACE ACE E - -
Janya Inc. - - - E -
Language Computer Corporation - - - E -
Lockheed Martin E - E E E
New York University C - - - -
Peking University - - - C C
#Polytechnic University of Hong Kong CE - - C -
SRA team #1 E E - - -
SRA team #2 E E - - -
#XIAMEN University C - - - -

# Sites that did not fulfill the requirement of attending the follow-up workshop.


Evaluation Results for the 5 Main ACE tasks

The tables below list the official results of the NIST 2005 Automatic Content Extraction evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.

Evaluation of Entities

The ACE "Entity Detection and Recognition" task requires systems to identify the occurrences of a specified set of entities {Persons, Organizations, Locations, Facilities, Geo-Politicals, Weapons, Vehicles} in the source language documents. Complete descriptions of ACE entities and entity attributes that are to be detected can be found in the ACE evaluation specification document .

Table (1a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 1a - Chinese - Entities
Site Overall Broadcast News Newswire Weblogs
IBM69.270.569.665.0
BBN Technologies68.867.970.167.1
New York University65.764.369.965.7
University of Colorado61.164.957.463.1
Polytechnic University of Hong Kong49.451.350.242.4
XIAMEN University47.644.851.044.0
Harbin Institute of Technology43.844.148.030.1
Basis Technology, Inc.3.83.04.72.8
Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between IBM and BBN technologies, using a 5% p-value.

Table (1b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 1b - English - Entities
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
SRA team #171.972.777.172.862.961.567.6
BBN Technologies71.771.875.472.267.759.771.6
SRA team #271.367.277.373.159.360.769.0
IBM69.661.776.272.057.860.566.3
University of Colorado68.567.273.772.765.150.261.6
Lockheed Martin57.458.763.360.048.541.750.6
University of Amsterdam27.326.831.527.622.620.324.4
Polytechnic University of Hong Kong20.820.331.725.16.6-10.813.1
Harbin Institute of Technology15.29.126.915.310.8-12.914.7
Basis Technology, Inc.4.01.49.63.6-3.8-3.32.2
Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between SRA team #1, BBN Technologies, SRA team #2, and IBM, using a 5% p-value.

######
######

Evaluation of Relations

The ACE "Relation Detection and Recognition" task requires systems to identify the occurrences of a specified set of relations {Artifacts, Gen-Affiliation, Metonymy, Org-Affiliation, Part-Whole, Person-Social, Physical} in the source language documents. Complete descriptions of ACE relations and relation attributes that are to be detected can be found in the ACE evaluation specification document .

Table (2a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 2a - Chinese - Relations
Site Overall Broadcast News Newswire Weblogs
IBM26.824.428.626.6
BBN Technologies22.720.624.621.8
University of Colorado21.022.420.419.4
Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, a difference in IBM's system performance was found to be significant at the 5% p-value.

Table (2b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 2b - English - Relations
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
SRA team #225.212.332.627.316.716.920.5
BBN Technologies25.116.028.326.235.619.617.0
IBM23.88.833.125.119.113.915.9
SRA team #123.411.829.424.721.317.318.2
University of Colorado20.18.726.122.720.10.617.8
Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, no difference is system performance was found between SRA team #2, BBN Technologies, and IBM, using a 5% p-value.

######
######

Evaluation of Events

The ACE "Event Detection and Recognition" task requires systems to identify the occurrences of a specified set of events {Life, Movement, Transaction, Business, Conflict, Contact, Personnel, Justice} in the source language documents. Complete descriptions of ACE events and event attributes that are to be detected can be found in the ACE evaluation specification document .

Table (3a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 3a - Chinese - Events
Site Overall Broadcast News Newswire Weblogs
BBN Technologies10.211.210.82.0

Table (3b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 3b - English - Events
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies14.46.212.317.815.213.115.4
IBM6.7-3.58.310.4-5.53.53.2
Lockheed Martin3.54.04.45.6-2.4-3.91.0
University of Amsterdam-8.6-13.4-8.4-6.9-11.9-8.8-10.4
Note: Using the Wilcoxon Signed Ranks test to compare system performance at the document level, a difference in BBN's system performance was found to be significant at the 5% p-value.

######
######

Evaluation of Values

The ACE "Value detection" task requires systems to identify the occurrences of a specified set of values {Contact-Info, Numeric, and when part of an event: Crime, Job-title, Sentence} in the source language documents. Complete descriptions of ACE values and value attributes that are to be detected can be found in the ACE evaluation specification document .

Table (4a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 4a - Chinese - Values
Site Overall Broadcast News Newswire Weblogs
Peking University49.748.742.171.2
BBN Technologies45.747.438.363.8

Table (4b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 4b - English - Values
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies34.826.612.643.425.048.738.5
Lockheed Martin25.530.00.826.425.049.732.6

######
######

Evaluation of Temporal Expressions

The ACE "TERN" task requires systems to identify the occurrences of a specified set of temporal expressions and specific attributes about the expressions {Value, Modifier, Anchor value, Anchor directionality, Set} and for the English data to normalize the expressions. Complete descriptions of ACE TERN and TERN attributes that are to be detected can be found in the ACE evaluation specification document .

Table (5a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 5a - Chinese - Temporal Expressions
Site Overall Broadcast News Newswire Weblogs
Polytechnic University of Hong Kong83.781.884.386.2
Peking University79.075.082.978.2
Note: The evaluation of temporal expression for the Chinese data did not involve normalization, making this an simpler task than was conducted using the English data.

Table (5b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 5b - English - Temporal Expressions
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
Language Computer Corp.63.748.065.672.656.263.461.7
Lockheed Martin56.239.862.653.158.655.862.8
Janya Inc.54.840.659.862.737.352.053.6
University of Amsterdam33.223.832.439.432.124.842.1



Section for the ACE05 Mention Level Tasks



Evaluation Participants in the Mention Level Tasks

The mention level tasks are designed to measure a system's ability to correctly identify all mentions of the ACE entities, relations, and events. The same set of types and attributes listed above apply to the mention level tasks.

The table below lists the sites that participated in one or more of the three ACE mention level tasks for this year's automatic content extraction evaluation. The letters 'A', 'C', and 'E' are used to identify which language(s) were processed for each task:

Site
entity
mentions
relation
mentions
event
mentions
BBN Technologies ACE ACE -
#Basis Technology, Inc. ACE - -
#Chinese Academy of Sciences C - -
University of Colorado CE CE -
#Harbin Institute of Technology CE - -
Lockheed Martin E - E
New York University C - -
Peking University C - -
#Polytechnic University of Hong Kong CE C -
SRA team #1 E - -
SRA team #2 E - -
University of Amsterdam E - E
#XIAMEN University C - -

# Sites that did not fulfill the requirement of attending the follow-up workshop.


Evaluation Results for the ACE Mention Level tasks

The tables below list the official results of the NIST 2005 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.

Evaluation of Entity Mentions

Table (6a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 6a - Chinese - Entity Mentions
Site Overall Broadcast News Newswire Weblogs
New York University79.178.478.382.9
BBN Technologies78.879.278.977.9
University of Colorado73.079.568.171.9
Harbin Institute of Technology62.864.265.751.8
Peking University62.261.262.962.3
XIAMEN University61.358.564.160.3
Chinese Academy of Sciences51.450.850.355.8
Basis Technology, Inc.46.644.948.545.2
Polytechnic University of Hong Kong43.445.642.341.4

Table (6b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 6b - English - Entity Mentions
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies85.186.884.284.693.774.085.0
SRA team #284.783.384.584.893.875.882.4
SRA team #183.784.084.484.089.974.681.5
University of Colorado82.985.182.485.291.366.379.7
Lockheed Martin68.369.173.673.259.555.666.2
University of Amsterdam45.644.338.842.274.136.341.7
Basis Technology, Inc.30.530.431.741.91.229.837.7
Polytechnic University of Hong Kong29.427.627.923.356.812.828.8
Harbin Institute of Technology19.117.529.922.211.8-8.524.0

######
######

Evaluation of Relation Mentions

Table (7a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 7a - Chinese - Relation Mentions
Site Overall Broadcast News Newswire Weblogs
BBN Technologies31.929.533.732.2
University of Colorado30.231.728.333.2
Peking University15.014.616.59.9

Table (7b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 7b - English - Relation Mentions
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies36.827.337.139.144.430.338.8
University of Colorado34.626.738.136.141.320.632.9

######
######

Evaluation of Event Mentions

Note: There were no entries for the "Chinese Event Mention Detection" task.
Table 8a - Chinese - Event Mentions
Site Overall Broadcast News Newswire Weblogs

Table (8b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 8b - English - Event Mentions
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
University of Amsterdam7.313.94.47.017.96.28.9
Lockheed Martin2.80.04.64.3-17.4.80.2




Section for the ACE05 Diagnostic Tasks



Evaluation Participants in the Diagnostic Tasks

The diagnostic tasks are offered to assist the researchers. They are designed to measure a system's ability for various ACE tasks when systems are given partial ground truth information (commonly referred to as cheating-experiments).

The table below lists the sites that participated in one or more of the three ACE diagnostic tasks for this year's automatic content extraction evaluation. The letters 'C', and 'E' are used to identify which language(s) were processed for each task:

Site
entities
given mentions
relations
given entities
events
given entities
University of Amsterdam - - E
BBN Technologies CE CE CE
University of Colorado CE CE -
Language Computer Corporation E - -
New York University E - E


Evaluation Results for the ACE Diagnostic tasks

The tables below list the official results of the NIST 2005 Automatic Content Extraction Evaluation. Scores for each site's primary system are shown. Systems are ordered by their "overall" value score.

Evaluation of Entities (EDR Co-Reference) Given Entity Mentions

Table (9a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 9a - Chinese - Entities Given Mentions
Site Overall Broadcast News Newswire Weblogs
University of Colorado90.490.990.190.4
BBN Technologies89.990.889.489.4

Table (9b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 9b - English - Entities Given Mentions
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies88.986.490.489.182.589.489.5
University of Colorado88.085.390.489.480.186.086.0
New York University87.084.188.788.475.284.888.3
Language Computer Corporation83.178.184.284.076.783.684.7

######
######

Evaluation of Relations, RDR Given correct entities, values and timex2's

Table (10a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 10a - Chinese - Relations Given Entities, Values, and TIMEX2s
Site Overall Broadcast News Newswire Weblogs
University of Colorado56.052.658.057.8
BBN Technologies53.150.654.057.0

Table (10b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 10b - English - Relations Given Entities, Values, and TIMEX2s
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies54.447.458.156.250.050.949.7
University of Colorado50.844.455.650.057.542.446.0

######
######

Evaluation of Events, VDR Given entities, values, and timex2s

Table (11a) list the overall value score for the Chinese evaluation test set, and breaks out the value score for each of the three domains.

Table 11a - Chinese - Events Given Entities, Values, and TIMEX2s
Site Overall Broadcast News Newswire Weblogs
BBN Technologies25.030.723.79.1

Table (11b) list the overall value score for the English evaluation test set, and breaks out the value score for each of the six domains.

Table 11b - English - Events Given Entities, Values, and TIMEX2s
Site Overall Broadcast
Conversations
Broadcast
News
Newswire Telephone Usernet
Newsgroups
Weblogs
BBN Technologies32.737.028.234.928.632.435.4
New York University29.734.226.332.431.424.330.4
University of Amsterdam19.720.618.921.012.013.423.8