|
CALL FOR PARTICIPATIONTopic Detection and Tracking TDT 2001
|
You are invited to participate in TDT 2001. This year is the fourth in a series of workshops investigating methods for organizing a stream of broadcast news into stories based on the real-world events they describe.
TDT 2001 will continue investigation into the five core tasks described below. The workshop format includes a training/development corpus available immediately, and a dry run evaluation on that data (optional for returning participants) to be held over the summer. The evaluation data will be distributed in September, with system results due in October, and evaluation results returned by late October.
Interested participants should contact the workshop leaders by early April, 2001, to help with planning the details of the workshop.
A workshop for participants will be held in conjunction with the TREC workshop in Gaithersburg, Maryland, on November 12-13.
The TDT workshop investigates organization of broadcast news by the events described in the news. Processing must be done as the news arrives, and not using a static collection. TDT investigates the following tasks:
Research on all tasks will continue in TDT 2001. However, strong emphasis will be placed on the tracking task this year, and in particular on how to normalize scores across topics. Participants are also strongly encouraged to focus on the Story Link Detection task because of its general applicability.
Variations on the tasks may be adopted this year, provided there is consensus from the participating sites. Those may include more realistic handling of "brief" stories (those that include only a passing mention of a topic), detection of non-news stories, and so on.
The evaluation corpus for TDT 2001 will be based upon three months of news stories from the end of 1998 (the TDT-3 corpus). The set of approximately 60,000 news stories is in either English or Mandarin, and from a variety of sources spanning newswire, television, radio, and the Web. Because this corpus was used for TDT 2000, it will be "perturbed" this year to make it different.
Evaluation topics will be selected from the TDT 1999 and TDT 2000 evaluation tasks.
Training and development data will use the TDT-2 corpus, approximately 72,000 news stories from the first six months of 1998. Sites will also be permitted to use the "unperturbed" TDT-3 corpus for development, though are cautioned against overfitting. Training and development topics will include approximately 100 topics available for the TDT-2 English corpus, of which 20 are also available for the TDT-2 Mandarin corpus. A set of 30 topics judged for the TDT-3 corpus will also be available.
A TDT-4 corpus is being created, but will not be ready this year. This corpus will contain all of the TDT-3 sources plus four additional Chinese broadcast sources. The four month collection will include news text and over 600 hours of English and Mandarin broadcast news yielding approximately 48,000 English and Mandarin news stories. Annotations will include story boundaries and 60 new topics defined and annotated following the processes that were used for TDT 2000 evaluation. Sites particularly interested in the new evaluation corpus should participate in TDT 2001 to be involved in the specifications for TDT-4.
TDT data is provided by the Linguistic Data Consortium. Sites that are current members of the LDC will have access to the data via their membership. Sites who are unable to join the LDC and cannot afford the appropriate license will be given access to the data via an Evaluation Membership as long as they are participating in TDT. (We are working on less-restrictive conditions, but the high cost of the intellectual property makes it difficult to achieve.)
More information about TDT, including details of TDT 2001, many TDT publications, and information from past TDT workshops, is available at the TDT Web site
Organizations wishing to participate in TDT 2001 should respond to this call by:
Please indicate your interest as soon as possible. Dates of all meetings and evaluations will be specified by April 1, 2001. Sites may join TDT 2001 after that time, but will have to abide by those deadlines.
This workshop will be conducted by the National Institute of Standards and Technology (NIST), with support from Defense Advanced Research Projects Agency's (DARPA) Translingual Information Detection, Extraction and Summarization (TIDES) Program.