2008 TRECVid Event Detection Evaluation
Plan
This document presents the
evaluation plan for event detection in surveillance video for TRECVid 2008. The
goal of the evaluation will be to build and evaluate systems that can detect
instances of a variety of observable events in the airport surveillance domain.
The video source data to be used is a ~100-hour corpus of video surveillance
data collected by the UK Home Office at the London Gatwick International
Airport.
Two event detection tasks will be
supported: a retrospective event detection task run with complete reference
annotations, and a “freestyle” experimental analysis track to permit
participants to explore their own ideas with regard to the airport surveillance
domain.
Because this is an initial effort,
the evaluation will be run as more of an experimental test-bed. By doing
so, we propose two changes to the typical evaluation paradigm. First, the entire source video corpus will be
released early so that research can begin immediately. Participants will be on the honor system to
keep the evaluation set blind. Second, two sets of events will be
defined: a required set defined by NIST and the LDC, whose descriptions and
annotations will be released quickly for research to begin, and an optional, secondary
set of events nominated by participants. The development resources (event
definitions and annotations) for nominated events will be released later in the
year. These steps will hopefully
encourage an acceleration of the research and knowledge sharing and will permit
faster evolution of the evaluation paradigm.
The following topics are discussed
below:
·
Video source data
·
Evaluation tasks
·
Evaluation measures
·
Evaluation Infrastructure
·
Event definitions
·
Schedule
The source data will consist of
100 hours (10 days * 2 hours/day * 5 cameras), obtained from Gatwick Airport
surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium will provide
event annotations for the entire corpus according to the milestones listed in
the schedule.
The 100-hour corpus will be
divided into development and evaluation subsets. In particular, the first 5
days of the corpus will be used as the development subset (devset), and the
second 5 days of the corpus will be used as the evaluation subset
(evalset).
Developers may use the devset in
any manner to build their systems, including activities such as dividing it
into internal test sets, jackknifed training, etc. During the summer months, NIST will conduct a
dry run evaluation using the devset as the video source. While testing on the development data is a
non-blind system test, the purpose of the dry run (to test the evaluation
infrastructure) is most easily accomplished using the devset.
We will release the full corpus (devset
+ evalset) early in the evaluation cycle to give people the opportunity to
preprocess the full corpus throughout the year. The evaluation set must not be
inspected or mined for information until after the evalset annotations are
released. The evalset restriction applies to both evaluation tasks. However,
participants can run feature extraction programs on the evalset to prepare for
the formal evaluation.
Allowable side information (i.e.,
“contextual” information) will include resources posted on the TRECVid Event
Detection website as well as any annotations constructed by developers based on
the devset. Participants may share devset annotations. No annotation of the
evalset is permitted prior to the evaluation submission deadline.
This proposal includes the
following evaluation tasks:
Systems will be evaluated on how well they can detect
event occurrences in the evaluation corpus.
The determination of correct detection will be based solely on the
temporal similarity between the annotated reference event observations and the
system-detected event observations.
System detection performance is measured as a
tradeoff between two error types: missed detections (MD) and false alarms
(FA). The two error types will be
combined into a single error measure using the Detection Cost Rate (DCR) model,
which is a linear combination of the two errors. The DCR model distills the needs of a
hypothetical application into a set of predefined constant parameters that
include the event priors and weights for each error type. While the chosen constants have been
motivated by discussions with the research and user communities, the single
operation point characterized by the DCR model is a small window into the
performance of an event detection system.
In addition to DCR measures, Detection Error Tradeoff (DET) curves will
be produced to graphically depict the tradeoff of the two error types over a
wide range of operational points. The DCR model and the DET curve are related:
the DCR model defines an optimal point along the DET curve.
The rest of this section defines the system output,
followed by the three steps of the evaluation process: temporal alignment,
Decision Error Tradeoff (DET) curve production, and DCR computations.
4.1. System Outputs
Systems will record observations of events in a
VIPER-formatted XML file as described in the “TRECVid 2008 Event Detection:
ViPER XML Representation of Events” document.
Each event observation generated by a system will include the following
items:
·
Start frame: The frame number indicating
the beginning of the observation (the first frame in the video source file is
frame #1.)
·
End frame: The frame number indicating
the last frame of the observation.
·
Decision score: A numeric score
indicating how likely the event observation exists with more positive values
indicating more likely observations.
·
Actual Decision: A Boolean value
indicating whether or not the event observation should be counted for the
primary metric computation.
The decision scores and actual decisions
permit performance assessment over a wide range of operating points. The decision scores provide the information
needed to construct the DET curve. In
order to construct a fuller DET curve, a system must over-generate putative
observations far beyond the optimal point for the system’s best DCR value. The actual decisions provide the mechanism
for the system to indicate which putative observations to include in the DCR
calculation: i.e., the putative decisions with a true actual decision.
Systems must ensure their decision
scores have the following two characteristics:
first, the values must form a non-uniform density function so that the
relative evidential strength between two putative terms is discernable. Second, the density function must be
consistent across events for a single system so that event-averaged measures using
decision scores are meaningful.
Since the decision scores are
consistent across events, the system must use a single threshold for
differentiating true and false actual decisions.
4.2. Event Alignment
Event observations can occur at any time and for any
duration. Therefore, in order to compare
the output of a system to the reference annotations, an optimal one-to-one
mapping is needed between the system and reference observations. The mapping is
required because there is no pre-defined segmentation in the video. The
alignment will be performed using the Hungarian Solution to the Bipartite Graph
[1]
matching problem by modeling event observations as nodes in the bipartite graph. The system observations are represented as
one set of nodes, and the reference observations are represented as a second
set of nodes. The kernel formulas below assume the mapping is performed for a
single event (Ej) at a
time.
![]()
![]()


![]()
Where:
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The
kernel function for observation comparisons,
has
two levels. The first level, indicated
by the
values, differentiates potentially mappable
observation pairs from non-mappable observation pairs. The second level takes
into account the temporal congruence of the system and reference event
observations and the observation’s detection score in relation to the system’s
range of detection score. The decision scores are taken into account to
facilitate the DET curve generation. By giving more weight to higher confidence
score observations, realignment can be avoided during DET curve production.
4.3. Detection
Error Tradeoff Curves
Graphical
performance assessment uses a Detection Error Tradeoff (DET) curve that plots a
series of event-averaged missed detection probabilities and false alarm rates
that are a function of a detection threshold, Θ. This Θ is applied to the system’s detection
scores meaning the system observations with scores above the Θ are ‘declared‘ to be the set of
detected observations. After Θ is applied, the measurements are then computed
separately for each event, then averaged to generate a DET line trace. The per-event formulas for PMiss and RFA are:
![]()
![]()
Where
![]()
![]()
![]()
![]()
The
formulas to compute averages over all events are defined as:
![]()
![]()
![]()
![]()
PMiss(S, Θ) is not defined for all
events because NTarg(Ei)
may be 0. Therefore PMiss(S, Θ)
is calculated over the set of events with true occurrences. This enables the evaluation of a system on
events that do not exist in the test corpus.
4.4. DCR
Computations
The evaluation
will use the Normalized Detection Cost Rate (NDCR) measure for evaluating system performance. NDCR
is a weighted linear combination of the system’s Missed Detection Probability
and False Alarm Rate (measured per unit time).
The measure’s derivation can be found in Appendix A and the final
formula is summarized below. NIST will
report an NDCR for each event and not average them over events.
![]()
Where:
![]()
![]()
![]()
![]()
![]()
The measure’s
unit is in terms of Cost per Unit Time which has been normalized so that an NDCR=0 indicates perfect performance and
an NDCR=1 is the cost of a system
that provides no output, i.e. PMiss=1
and PFA=0.
Two versions
of the NDCR will be calculated for each system: the Actual NDCR and the Minimum
NDCR.
4.4.1. Actual
NDCR
The Actual NDCR
is the primary evaluation metric. It is
computed by restricting the putative observations to those with true actual decisions.
4.4.2. Minimum
NDCR
The Minimum NDCR
is a diagnostic metric. It is found by
searching the DET curve for the Θ with the minimum cost. The difference between the value of Minimum
NDCR and Actual NDCR indicates the benefit a system could have gained by
selecting a better threshold.
5. Events
Initially,
a video event is defined to be “an observable action or change of state in a
video stream that would be important for airport security management”. Events may vary greatly in duration, from 2
frames to longer duration events that can exceed the bounds of the excerpt.
Events
will be described through an “event description document”. The document will include a textual
description of the event and a set of exemplar event occurrences (annotations).
Each exemplar will indicate the source file and temporal coordinates of the
event. Events will be considered to be
independent for the evaluation.
Therefore, systems may build separately trained models for each event.
There
will be two sets of events: Required events and Optional events. There is no implicit difference between the
types of events included in the event sets. Systems should output detection results for any three events in the
required set. For the optional set, systems can output detection results
for some or all of the events.
The
required event set is as follows:
PersonRuns,
CellToEar, ObjectPut, PeopleMeet, PeopleSplitUp, Embrace, Pointing,
ElevatorNoEntry, OpposingFlow, and TakePicture (defined in the “event
description document”).
Submissions
will be made via ftp according to the instructions in Appendix B. In addition
to the system output, a system description is also required for each condition.
This description must include a description of the hardware used to process the
data, and a detailed description of the architecture and algorithms used in the
system.
The
proposed schedule for event definitions and data release is as follows:
|
|
Required Event Set |
Optional Event Set |
|
Event
Selection |
By
LDC and NIST |
Nominated
by community input |
|
Event
Description Release |
March
1 |
May
15 |
|
Development
Annot. Release |
June
1 |
July
1 |
|
Test
Set Annot. Released |
Oct.
1 |
Oct.
1 |
|
Participation |
Required |
Optional |
The
2008 evaluation schedule for event detection includes the following milestones:
Jan.--Mar.:
Event detection planning & telecons
Feb.:
Call for participation in TRECVid
Mar.
10: Release of video data, required event definitions, and examples
Mar.
30: Final evaluation plan & guidelines written
Apr.
4: Call for participation in event detection
Apr.
11: Deadline to commit
May
1: Nominations for candidate events end
May
15: Release of optional event definitions
June
1: Release scoring tool
June
1: Development annotations for required events released
June
1: Dry Run test set specified
July
1: Development annotations for nominated events released
July:
Dry run (systems run on Dev data)
Sept.
26: Obtain submissions for formal evaluation
Oct
1: Release of all annotations
Oct.
1: Distribute preliminary results
Oct.
10: Distribute final results
Oct.
27: Notebook papers due at NIST
November
17-18: Present results at TRECVid
8. References
[1] Harold W. Kuhn, "The Hungarian Method for the
assignment problem", Naval Research
Logistic Quarterly, 2:83-97,
1955.
[2] Martin, A., Doddington, G., Kamm, T., Ordowski, M.,
Przybocki, M., “The DET Curve in Assessment of Detection Task Performance”, Eurospeech 1997, pp 1895-1898.
Appendix
A: Derivation of Average Normalized Detection Cost Rate
Average
Normalized Detection Cost Rate (ANDCR)
is a weighted linear combination of the system’s Missed Detection Probability
and False Alarm Rate (measured per unit time).
The constant parameters of ANDCR, which are specified below, represent
both the richness of events in the source data and the relative detriment of
particular error types to a hypothetical application.
The cost of a
system begins with the cost of missing an event (CostMiss) and the cost of falsely detecting an event (CostFA). NMiss(S,E)
is the number of missed detections for system S, event E. NFA(S,E)
is the number of false alarms for the same system and event.
![]()
To facilitate
comparisons across systems and test sets, we convert Detection Cost to a rate
by dividing by the length of the source data.
Typically, we make this conversions to percentages by dividing by the
count of discrete units for which systems make decisions. In a streaming environment, there are no
discrete units, therefore normalizing by unit time is a more appropriate
normalization. Note also that the
measure of Type I error, RFA,
is commonly used in surveillance-style applications.
![]()
![]()
![]()
![]()
RTarget(E) is the rate of occurrences for the
event. This value is dependent on the
event but providing this prior to a system for each event changes the
definition of an event – it includes the event definition and the prior. Instead, we replace the event-dependent prior
with a single, global prior, RTarget,
that in combination with the CostMiss
and CostFA reflects the characteristics
of the surrogate application. While the events for the evaluation have not been
selected yet, we expect the them to have similar numbers of occurrences[1]:
neither too frequent or too rare.
Therefore, the single prior is warranted. The modified formula
becomes:
![]()
The range of
the DCRSys measure is
[0,∞). To ground the costs, a
second normalization scales the cost to be 0 for perfect performance and 1 to
be the cost of a system that provides no output (therefore PMiss = 1 and PFA
= 0). The resulting formula is the
Normalized Detection Cost Rate of a system (NDCR).
![]()
![]()
![]()
![]()
Where:
![]()
Beta is
separated out because it is composed of constant values that define the
parameters of the surrogate application.
To calculate
performance over an ensemble of events, we define Average Normalized Detection
Cost Rate (ANDCR(S)) by averaging the
Missed Detection probabilities for all events with at least one true event occurrence
(NEventsNZ) and averaging
the False Alarm Rates overall events. By
separating the two averages, the measure can incorporate events with no true
occurrences while remaining defined.
![]()
Where:
![]()
![]()
![]()
![]()
![]()
The measure’s unit is in terms of Cost per Unit Time.
Appendix
B: Submission Instructions
This
appendix will be filled out at a later date.