OpenHaRT 2010 Evaluation Results
Release Date: October 15th, 2010 - 14:37 EDT
The NIST Open Handwriting Recognition and Translation Evaluation (OpenHaRT) is an evaluation of image-to-text transcription and translation technologies and is open to all who find the tasks of interest. The 2010 evaluation was the first evaluation of this series and was conducted in accordance with the protocol described in the 2010 OpenHaRT evaluation plan.
Disclaimer
These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers were generally from research systems, not commercially available products. Since OpenHaRT was an evaluation of research algorithms, the test design required local implementation by each participant. As such, participants were only required to submit their system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.
The data, protocols, and metrics employed in this evaluation were chosen to support research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.
The 2010 OpenHaRT evaluation was the first in what we envision to be a long series of document understanding technology evaluations. Developing a strong evaluation series will require us to learn from our evaluation methods. In 2010, some of the evaluation protocols were suggested to be too restrictive for first time participation - requiring several participants to forgo submissions in many conditions. As we address these concerns, we expect the evaluation series to become more informative to both NIST and the participants. The 2010 OpenHaRT evaluation was a great learning experience for all involved and we look forward to building on our findings.
Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.
Results Release History
- October 15, 2010: First public release
- October 1, 2010: included UOB results
- August 27, 2010:
- Included results of submissions by A2iA and tubitak that were on-time but were not scored
- Separated results into official and pilot categories
- Corrected a bug in the translation normalization process and rescored submissions for the translation tasks
- August 10, 2010: Preliminary results released to participants
Evaluation Tasks
The three evaluation tasks measure different aspect within to overall system:
- Document Image Translation (DIT) - measures the overall performance of the system in translation the document image into accurate and fluent English text
- Document Text Translation (DTT) - measures the translation component of the system given a manually produced transcription of the text in the document image. This is the contrastive task to the DIT task.
- Document Image Recognition (DIR) - measures the optical character recognition component of the system in transcribing the text in the document image
Segmentation Conditions
The two segmentation conditions explore relationship between the system's performance and the system's ability to segment the data:
- Word segmentation - the system was given the word boundaries that were marked by human annotators
- Line segmentation - the system was given the line polygons that were derived from the word annotations
Performance Measurements
System performances on translation tasks (DIT, DTT) are measured using the following metrics:
- TER (v0.7.25) - is the primary translation metric. It is an error percentage metric. The results are presented as 100-TER.
- METEOR (v0.7) - serves as a contrastive translation metric. It is an accuracy metric with values between 0 - 1 with the higher the value indicating the better the performance.
- BLEU (v1.04) - is another contrastive translation metric. It is also an accuracy metric with values between 0 - 1 with the higher the value indicating the better the performance.
System performances on transcription task (DIR) are measured using WER. WER is an error percentage metric. The results are presented as 100-WER. Punctuations are filtered but are scored as is.
Participants
The table below lists the organizations and the task for which they registered to participate in the evaluation. The submissions have the following descriptors:
- official - site signed up for the task and submitted the results on-time
- pilot - site signed up for the task and submitted the results after the deadline
- withdrawn (no submission) - site signed up for the task and withdrew from participation
| Site ID |
Organization |
Location |
DIT with word segmentation |
DIT with line segmentation |
DIR with word segmentation |
DIR with line segmentation |
DTT |
| A2iA |
A2iA |
France |
- |
- |
official |
- |
- |
| APPTEK |
Applications Technology, Inc. |
USA |
withdrawn |
withdrawn |
- |
- |
official |
| IfNREGIM |
Institute for Communications Technology Braunschweig Technical University |
Germany |
- |
- |
pilot |
withdrawn |
- |
| UPV-PRHLT |
Pattern Recognition and Human Language Technology Group Universitat Politecnia de Valencia |
Spain |
withdrawn |
withdrawn |
official |
official |
withdrawn |
| tubitak |
The Scientific and Technological Research Council of Turkey |
Turkey |
pilot |
withdrawn |
pilot |
withdrawn |
pilot |
| uob |
University of Balamand |
Lebanon |
pilot |
withdrawn |
pilot |
withdrawn |
withdrawn |
Evaluation Results
The tables and graphs below give the overall results for each task and segmentation condition over the entire data set. The official results section contains the results for on-time submissions while the pilot results section contains the results for late submissions. Cross-site comparisons are limited to the primary systems. The contrastive systems are only compared against the primary system from the same site for the same task and segmentation condition.
Official Results
Results for Document Text Translation Task
| ID | 100-TER | METEOR | BLEU |
| APPTEK.primary.1 | 43.7502 | 0.6079 | 0.2485 |
| tubitak.primary.1 | 36.2983 | 0.5760 | 0.2372 |
Results for APPTEK Document Text Translation Task
| ID | 100-TER | METEOR | BLEU |
| APPTEK.primary.1 | 43.7502 | 0.6079 | 0.2485 |
| APPTEK.c1.1 | 43.7392 | 0.6091 | 0.2444 |
Results for tubitak Document Text Translation Task
| ID | 100-TER | METEOR | BLEU |
| tubitak.c2.1 | 42.4548 | 0.5723 | 0.2528 |
| tubitak.c1.1 | 42.4294 | 0.5733 | 0.2543 |
| tubitak.primary.1 | 36.2983 | 0.5760 | 0.2372 |
Results for Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| A2iA.primary.1 | 62.3053 |
| UPV-PRHLT.primary.1 | 48.5132 |
Results for A2iA Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| A2iA.primary.1 | 62.3053 |
| A2iA.c4.1 | 62.2244 |
| A2iA.c2.1 | 61.3029 |
| A2iA.c1.1 | 54.0070 |
| A2iA.c3.1 | 53.8251 |
| A2iA.c0.1 | 44.9447 |
Results for UPV-PRHLT Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| UPV-PRHLT.c1.1 | 51.0620 |
| UPV-PRHLT.primary.1 | 48.5132 |
Results for Document Image Recognition Task with Line Segmentation
| ID | 100-WER |
| UPV-PRHLT.primary.1 | 52.5418 |
Results for UPV-PRHLT Document Image Recognition Task with Line Segmentation
| ID | 100-WER |
| UPV-PRHLT.c1.1 | 52.5418 |
| UPV-PRHLT.primary.1 | 52.5418 |
Pilot Results
Results for Document Image Translation Task with Word Segmentation
| ID | 100-TER | METEOR | BLEU |
| tubitak.primary.1 | 15.8386 | 0.2629 | 0.0498 |
| uob.primary.1 | 7.4699 | 0.1637 | 0.0181 |
Results for tubitak Document Image Translation Task with Word Segmentation
| ID | 100-TER | METEOR | BLEU |
| tubitak.primary.1 | 15.8386 | 0.2629 | 0.0498 |
Results for Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| tubitak.primary.1 | 29.0510 |
| uob.primary.1 | 28.6038 |
| IfNREGIM.primary.2 | 0.4448 |
Results for IfNREGIM Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| IfNREGIM.primary.2 | 0.4448 |
Results for tubitak Document Image Recognition Task with Word Segmentation
| ID | 100-WER |
| tubitak.primary.1 | 29.0510 |