2009 TRECVid Event
Detection Evaluation Plan
This document presents the
evaluation plan for event detection in surveillance video for TRECVid 2009. The goal of the evaluation will be to build
and evaluate systems that can detect instances of a variety of observable
events in the airport surveillance domain. The video source data to be
used is a ~100-hour corpus of video surveillance data collected by the UK Home
Office at the London Gatwick International Airport.
Two event detection tasks will be
supported: a retrospective event detection task run with complete reference
annotations, and a “freestyle” experimental analysis track to permit
participants to explore their own ideas with regard to the airport surveillance
domain.
The following topics are discussed
below:
·
Video source data
·
Evaluation tasks
·
Evaluation measures
·
Evaluation Infrastructure
·
Event definitions
·
Schedule
The source data will consist of at
least 100 hours (10 days * 2 hours/day * 5 cameras), obtained from Gatwick
Airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium will provide
event annotations for the entire corpus.
The corpus will be divided into development
and evaluation subsets. The exact partitioning of the data is to be determined.
Developers may use the devset in any manner to build their systems, including
activities such as dividing it into internal test sets, jackknifed training,
etc. During the summer months, NIST will
conduct a dry run evaluation using the devset as the
video source. While testing on the
development data is a non-blind test, the purpose of the dry run (to test the
evaluation infrastructure) is most easily accomplished using the devset.
We will release the full corpus (devset + evalset) early in the
evaluation cycle to give people the opportunity to preprocess the full corpus
throughout the year. The evaluation set must not be inspected or mined for
information until after the evalset annotations are
released. The evalset restriction applies to both
evaluation tasks. However, participants can run feature extraction programs on
the evalset to prepare for the formal evaluation.
Allowable side information (i.e.,
“contextual” information) will include resources posted on the TRECVid Event Detection website as well as any annotations
constructed by developers based on the devset.
Participants may share devset annotations. No annotation
of the evalset is permitted prior to the evaluation
submission deadline.
This proposal includes the
following evaluation tasks:
Systems will be evaluated on how well they can detect
event occurrences in the evaluation corpus.
The determination of correct detection will be based solely on the
temporal similarity between the annotated reference event observations and the
system-detected event observations.
System detection performance is measured as a
tradeoff between two error types: missed detections (MD) and false alarms
(FA). The two error types will be
combined into a single error measure using the Detection Cost Rate (DCR) model,
which is a linear combination of the two errors. The DCR model distills the needs of a
hypothetical application into a set of predefined constant parameters that
include the event priors and weights for each error type. While the chosen constants have been
motivated by discussions with the research and user communities, the single
operation point characterized by the DCR model is a small window into the
performance of an event detection system.
In addition to DCR measures, Detection Error Tradeoff (DET) curves will
be produced to graphically depict the tradeoff of the two error types over a
wide range of operational points. The DCR model and the DET curve are related:
the DCR model defines an optimal point along the DET curve.
The rest of this section defines the system output,
followed by the three steps of the evaluation process: temporal alignment,
Decision Error Tradeoff (DET) curve production, and DCR computations.
4.1. System Outputs
Systems will record observations of events in a
VIPER-formatted XML file as described in the “TRECVid
2008 Event Detection: ViPER XML Representation of
Events” document. Each event observation
generated by a system will include the following items:
·
Start frame: The frame number indicating
the beginning of the observation (the first frame in the video source file is
frame #1.)
·
End frame: The frame number indicating
the last frame of the observation.
·
Decision score: A numeric score
indicating how likely the event observation exists with more positive values
indicating more likely observations.
·
Actual Decision: A Boolean value
indicating whether or not the event observation should be counted for the
primary metric computation.
The decision scores and actual decisions
permit performance assessment over a wide range of operating points. The decision scores provide the information
needed to construct the DET curve. In
order to construct a fuller DET curve, a system must over-generate putative
observations far beyond the optimal point for the system’s best DCR value. The actual decisions provide the mechanism
for the system to indicate which putative observations to include in the DCR
calculation: i.e., the putative decisions with a true actual decision.
Systems must ensure their decision
scores have the following two characteristics:
first, the values must form a non-uniform density function so that the
relative evidential strength between two putative terms is discernable. Second, the density function must be
consistent across events for a single system so that event-averaged measures using
decision scores are meaningful.
Since the decision scores are
consistent across events, the system must use a single threshold for
differentiating true and false actual decisions.
4.2. Event Alignment
Event observations can occur at any time and for any
duration. Therefore, in order to compare
the output of a system to the reference annotations, an optimal one-to-one
mapping is needed between the system and reference observations. The mapping is
required because there is no pre-defined segmentation in the video. The
alignment will be performed using the Hungarian Solution to the Bipartite Graph
[1]
matching problem by modeling event observations as nodes in the bipartite graph. The system observations are represented as
one set of nodes, and the reference observations are represented as a second
set of nodes. The kernel formulas below assume the mapping is performed for a
single event (Ej)
at a time.
![]()
![]()


![]()
Where:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The kernel function for
observation comparisons,
has two levels. The first level, indicated by the
values, differentiates potentially mappable
observation pairs from non-mappable observation pairs. The second level takes
into account the temporal congruence of the system and reference event
observations and the observation’s detection score in relation to the system’s
range of detection score. The decision scores are taken into account to
facilitate the DET curve generation. By giving more weight to higher confidence
score observations, realignment can be avoided during DET curve production.
4.3. Detection Error Tradeoff Curves
Graphical performance assessment
uses a Detection Error Tradeoff (DET) curve that plots a series of
event-averaged missed detection probabilities and false alarm rates that are a
function of a detection threshold, Θ. This Θ is applied to the system’s
detection scores meaning the system observations with scores above the Θ are ‘declared‘
to be the set of detected observations. After Θ is
applied, the measurements are then computed separately for each event, then averaged to generate a DET line trace. The per-event formulas for PMiss
and RFA are:
![]()
![]()
Where
![]()
![]()
![]()
![]()
The formulas to compute averages
over all events are defined as:
![]()
![]()
![]()
![]()
PMiss(S, Θ) is not defined for all events
because NTarg(Ei) may be 0. Therefore PMiss(S, Θ)
is calculated over the set of events with true occurrences. This enables the evaluation of a system on
events that do not exist in the test corpus.
The evaluation will use the Normalized Detection Cost
Rate (NDCR) measure for evaluating
system performance. NDCR is a weighted linear combination of the system’s Missed
Detection Probability and False Alarm Rate (measured per unit time). The measure’s derivation can be found in
Appendix A and the final formula is summarized below. NIST will report an NDCR for each event and
not average them over events.
![]()
Where:
![]()
![]()
![]()
![]()
![]()
The measure’s unit is in terms of Cost per Unit Time
which has been normalized so that an NDCR=0
indicates perfect performance and an NDCR=1
is the cost of a system that provides no output, i.e. PMiss=1 and PFA=0.
For the 2009 evaluation, CostFA
= 1, CostMiss = 10, and RTarget = 20.0.
Two versions of the NDCR will be calculated for each
system: the Actual NDCR and the Minimum NDCR.
4.4.1.
Actual NDCR
The Actual NDCR is the primary evaluation
metric. It is computed by restricting
the putative observations to those with true
actual decisions.
4.4.2.
Minimum NDCR
The Minimum NDCR is a diagnostic metric. It is found by searching the DET curve for
the Θ with the minimum cost. The difference between the value of Minimum
NDCR and Actual NDCR indicates the benefit a system could have gained by
selecting a better threshold.
5. Events
Initially, a video event is
defined to be “an observable action or change of state in a video stream that
would be important for airport security management”. Events may vary greatly in duration, from 2
frames to longer duration events that can exceed the bounds of the excerpt.
Events will be described through
an “event description document”. The
document will include a textual description of the event and a set of exemplar
event occurrences (annotations). Each exemplar will indicate the source file
and temporal coordinates of the event.
Events will be considered to be independent for the evaluation. Therefore, systems may build separately
trained models for each event.
There will be two sets of events:
Required events and Optional events.
There is no implicit difference between the types of events included in
the event sets. Systems should output
detection results for any three events in the required set. For the
optional set, systems can output detection results for some or all of the
events.
The required event set is as
follows:
PersonRuns, CellToEar, ObjectPut, PeopleMeet, PeopleSplitUp,
Embrace, Pointing, ElevatorNoEntry, OpposingFlow, and TakePicture
(defined in the “event description document”).
6. Submission of results
Submissions will be made via ftp according to the
instructions in Appendix B. In addition to the system output, a system
description is also required for each condition. This description must include
a description of the hardware used to process the data, and a detailed
description of the architecture and algorithms used in the system.
The 2009 evaluation schedule for
event detection includes the following milestones:
Jan.--Mar.: Event detection
planning & telecons
Feb.: Call for participation in TRECVid
Mar.: Development and Evaluation
data specified
Mar.: Release of video data,
required event definitions, and examples
Mar.: Final evaluation plan &
guidelines written
Mar.: Release scoring tool
Mar.: Development annotations for
required events released
Jul.: Dry run (systems run on Dev
data)
Sep.: Obtain submissions for
formal evaluation
Oct. 1: Distribute preliminary
results
Oct. 10: Distribute final results
Oct.: Notebook papers due at NIST
Nov.: Present results at TRECVid
8. References
[1] Harold W. Kuhn, "The Hungarian Method for the
assignment problem", Naval Research
Logistic Quarterly, 2:83-97,
1955.
[2] Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M., “The DET Curve in Assessment of Detection
Task Performance”, Eurospeech 1997, pp 1895-1898.
Appendix A: Derivation of Average
Normalized Detection Cost Rate
Average Normalized Detection Cost Rate (ANDCR) is a weighted linear combination
of the system’s Missed Detection Probability and False Alarm Rate (measured per
unit time). The constant parameters of
ANDCR, which are specified below, represent both the richness of events in the
source data and the relative detriment of particular error types to a
hypothetical application.
The cost of a system begins with the cost of missing
an event (CostMiss)
and the cost of falsely detecting an event (CostFA). NMiss(S,E) is the number of missed detections for system S, event E. NFA(S,E) is the number of false
alarms for the same system and event.
![]()
To facilitate comparisons across systems and test
sets, we convert Detection Cost to a rate by dividing by the length of the
source data. Typically, we make this conversions to percentages by dividing by the count of
discrete units for which systems make decisions. In a streaming environment, there are no
discrete units, therefore normalizing by unit time is
a more appropriate normalization. Note also
that the measure of Type I error, RFA,
is commonly used in surveillance-style applications.
![]()
![]()
![]()
![]()
RTarget(E) is the rate of occurrences for
the event. This value is dependent on
the event but providing this prior to a system for each event changes the
definition of an event – it includes the event definition and the prior. Instead, we replace the event-dependent prior
with a single, global prior, RTarget, that in combination with the CostMiss and CostFA
reflects the characteristics of the surrogate application. While the events for
the evaluation have not been selected yet, we expect the them to have similar
numbers of occurrences[1]: neither too frequent or too rare. Therefore, the single prior is warranted. The
modified formula becomes:
![]()
The range of the DCRSys measure is [0,∞). To ground
the costs, a second normalization scales the cost to be 0 for perfect
performance and 1 to be the cost of a system that provides no output (therefore
PMiss
= 1 and PFA = 0). The resulting formula is the Normalized
Detection Cost Rate of a system (NDCR).
![]()
![]()
![]()
![]()
Where:
![]()
Beta is separated
out because it is composed of constant values that define the parameters of the
surrogate application.
To calculate performance over an ensemble of events,
we define Average Normalized Detection Cost Rate (ANDCR(S)) by averaging the Missed Detection probabilities for all
events with at least one true event occurrence (NEventsNZ) and
averaging the False Alarm Rates overall events.
By separating the two averages, the measure can incorporate events with
no true occurrences while remaining defined.
![]()
Where:
![]()
![]()
![]()
![]()
![]()
The measure’s unit is in terms of Cost per
Unit Time.
Appendix B: Submission Instructions
The
packaging and file naming conventions for the TRECVid
2009 Event Detection evaluation relies on Experiment
Identifiers (EXP-ID) to organize and identify the files for each evaluation
condition and link the system inputs to system outputs. Since EXP-IDs may be used in multiple
contexts, some fields contain default values. The following section describes
the EXP-IDs to be used for the London Gatwick airport surveillance development
dataset (devset) and evaluation dataset (evalset).
The
following BNF describes the EXP-ID structure:
EXP-ID ::=
<SITE>_<YEAR>_<TASK>_<DATA>_<LANG>_<INPUT>_
<SYSID>_<VERSION>
where,
<SITE> ::=
expt | short name of participant’s site
The special SITE code “expt” is used in the EXP-ID to indicate a reference
annotation.
<YEAR> ::=
2009
<TASK> ::=
retroED
<DATA> ::=
DEV09 | EVAL09
<LANG> ::=
ENG
<INPUT> ::=
s-camera
<SYSID> ::=
a site-specified string (that does not contain underscores) designating the
system used
The SYSID string must be present.
It is to begin with p- for a primary system or with c- for any contrastive
systems. For example, this string could be p-baseline or c-contrast. This field
is intended to differentiate between runs for the same evaluation condition.
Therefore, a different SYSID should be created for runs where any changes were
made to a system.
<VERSION>
::= 1..n (with values greater than 1 indicating multiple runs of the
same experiment/system)
In
order to facilitate transmission to NIST and subsequent scoring, submissions
must be made using the following protocol, consisting of three steps: (1)
preparing a system description, (2) packaging system outputs and system
descriptions, and (3) transmitting the data to NIST.
B.1 System
Descriptions
Documenting
each system is vital to interpreting evaluation results. As such, each submitted system, (determined
by unique experiment identifiers), must be accompanied by a system description
with the following information:
Section 1. Experiment
Identifier(s)
List
all the experiment IDs for which system outputs were submitted. Experiment IDs
are described in further detail below.
Section 2. System
Description
A
brief technical description of your system; if a contrastive test, contrast
with the primary system description.
List
all events processed on a single line as follows:
Events_Processed: Event1 Event2 Event3 …
(Note:
each event should come from the required event set.)
Section 3. Training:
A
list of resources used for training and development.
Section 4. References:
A list of all pertinent
references.
B.2 Packaging Submissions
All
system output submissions must be formatted according to the following
directory structure:
output/<EXP-ID>/<EXP-ID>.txt
output/<EXP-ID>/<SOURCE_FILE>*.xml
where,
EXP-ID
is the experiment identifier,
<EXP-ID>.txt
is the system description file as specified above (section B.1),
<
SOURCE_FILE >*.xml is the set of ViPER-formatted
system output files generated for the indicated experiment. The SOURCE_FILE
stem corresponds to the name of the video, without its extension. The events detected for each source file must
be combined into a single ViPER file. Thus, for a given EXP-ID, there will be an
experimental control file (ECF), and several system output files, one for each
video referenced in the ECF. (Note: the
evaluation tools will contain a ViPER file merging
script ‘TV08MergeHelper’, useful for combining ViPER
files that may contain only observations for a single event into a larger ViPER file.)
For
example, the ECF ‘expt_2009_retroED_DEV08_ENG_s-camera_NIST_1.ecf.xml’ will
have the following system outputs (from site SITE,
system SYS), in a directory named SITE_2009_retroED_DEV09_ENG_ s-camera_NIST_1:
B.3 Transmitting Submissions
To
prepare your submission, first create the previously described file/directory
structure. This structure may contain the output of multiple experiments,
although you are free to submit one experiment at a time if you prefer. The
following instructions assume that you are using the UNIX operating system. If
you do not have access to UNIX utilities or ftp, please contact NIST to make
alternate arrangements.
First,
change directory to the parent directory of your “output/” directory. Next, type
the following command:
tar -cvf -
./output | gzip > <SITE>_<SUB-NUM>.tgz
where,
<SITE>
is the ID for your site
<SUB-NUM>
is an integer 1 to n, where 1
identifies your first submission, 2 your second, etc.
This
command creates a single tar/gzip file containing all
of your results. Next, ftp to jaguar.ncsl.nist.gov giving the
username 'anonymous' and (if requested) your e-mail address as the password.
After you are logged in, issue the following set of commands, (the prompt will
be 'ftp>'):
ftp> cd
incoming
ftp> binary
ftp> put
<SITE>_<SUB-NUM>.tgz
ftp> quit
Note
that because the “incoming” ftp directory (where you just ftp’d
your submission) is write protected, you will not be able to overwrite any
existing file by the same name (you will get an error message if you try), and
you will not be able to list the incoming directory (i.e., with the “ls” or “dir” commands). Please note whether you get any
error messages from the ftp process when you execute the ftp commands stated
above and report them to NIST.
The
last thing you need to do is send an e-mail message to jfiscus@nist.gov,
travis.rose@nist.gov, and
martial@nist.gov to notify
NIST of your submission. The following information should be included in your
email:
Please
submit your files in time for us to deal with any transmission errors that
might occur well before the due date if possible. Submissions must validate
against TrecVid08.xsd, or they will be rejected. Note that submissions received
after the stated due dates for any reason will be marked late.
B.4 Scoring Submissions
To
score system output against ground truth, the TV08Scorer (part of F4DE
software) is invoked as follows:
TV08Scorer
--showAT --allAT --writexml <writexml_dir> --pruneEvents
--computeDETcurve --OutputFileRoot <outputfileroot_filebase>
--titleOfSys "<titleofsys_string>" --observationCont
--ecf <ecf_file>
<submissionsysfiles_list> --gtf
<referencefiles_list> --fps 25 --deltat 10
--limitto
<event1,event2,event3> --MissCost 10 --CostFA 1 --Rtarget 20
where:
-
<writexml_dir> is a directory in which
alignment ViPER files will be written
(containing Mapped, Unmapped_Sys, Unmapped_Ref Events) (example
value: alignment_xmls)
-
<outputfileroot_filebase> is the file base for
all DCR output (example value: Dcr_results)
-
<titleofsys_string> is the string that will be
added to the plots (example value:
<SITE>
| <SYS-ID>)
-
<ecf_file> is the ECF (XML) file against which
the scoring is made (note: all the files listed within the sys or ref files
have to be listed in the ECF for scoring to occur) (example value:
expt_2008_retroED_DEV08_ENG_s-camera_p-baseline_1.ecf)
-
<submissionsysfiles_list> is the list of SYS
files provided (example from dryrun:
sys/LGW_20071112_E1_CAM1.xml sys/LGW_20071112_E1_CAM2.xml
sys/LGW_20071112_E1_CAM3.xml sys/LGW_20071112_E1_CAM4.xml
sys/LGW_20071112_E1_CAM5.xml)
-
<referencefiles_list> is the list of REF files
to score against (example from dryrun: ref/LGW_20071112_E1_CAM1.xgtf
ref/LGW_20071112_E1_CAM2.xgtf
ref/LGW_20071112_E1_CAM3.xgtf ref/LGW_20071112_E1_CAM4.xgtf
ref/LGW_20071112_E1_CAM5.xgtf)
-
<event1,event2,event3> is the list of events
scored here and should be the same list as found in the submission's "Events_Processed:" line but comma separated here
instead of space separated.
Appendix C: Experimental Control File
(ECF)
Experiment Control Files (ECF)s are the mechanism the
evaluation infrastructure uses to specify the source file excerpts to use for a
specific experimental condition. The ECF
file specifies a several attributes of excerpt such as the time regions within a media file,
the language, and the source type.
Two ECF files will be used for the evaluation. The first is a system input ECF file that will
be provided for all tasks to indicate what data is to be processed by the
system. The second is a scoring ECF file that the evaluation code uses to
determine the range of data to evaluate the system on. In the event a problem is discovered with the
data, a special scoring ECF file will be used to specify the time regions to be
scored or excluded from scoring.
1. ECF File Naming
ECF file names use the relevant EXP-ID and end with
the ‘.ecf.xml’ extension.
2. ECF File Format Description
An ECF file is an XML-formatted file that consists of
a hierarchy of nodes, “ecf”, “excerpt_list”,
and “excerpt”, described below:
The “ecf” node contains an “excerpt_list” node and the
following elements:
·
source_signal_duration: a
floating point number indicating the total duration in seconds of the video
recording specified by the excerpts
· version: a version identifier for the ECF file
·
excerpt_list: node
contains a set of “excerpt” nodes.
The “excerpt”
element is a non-overlapping node that specifies the excerpt from a media file
to be used in the evaluation. The “excerpt” node has no attributes. The “excerpt” node has the following
elements:
·
filename: this
attribute indicates the source file.
·
language: the
language of the source file (“english”).
·
source_type: the
source type of the recording (“surveillance”).
·
begin: the
beginning time of the segment to process. The time is measured in seconds from
the beginning of the recording which is time = 0.0
·
duration: the
duration of the excerpt measured in seconds.
3. An example ECF XML fragment:
<ecf>
<source_signal_duration>7200.00</ source_signal_duration >
<version>20080601_1400</version>
<excerpt_list>
<excerpt>
<filename>LGW_20071101_E1_CAM1.mpeg </filename>
<begin>0.0</begin>
<duration>300.00</duration>
<language>english</language>
<source_type>surveillance</source_type>
</excerpt>
<excerpt>
<filename>LGW_20071101_E1_CAM1.mpeg </filename>
<begin>600.0</begin>
<duration>300.00</duration>
<language>english</language>
<source_type>surveillance</source_type>
</excerpt>
…
<excerpt>
<filename> LGW_20071112_E1_CAM5.mpeg </filename>
<begin>0.0</begin>
<duration>300.00</duration>
<language>english</language>
<source_type>surveillance</source_type>
</excerpt>
…
</excerpt_list>
…
</ecf>