Cross-Language Indian News Story Search 2012

Synopsis
Task
Data
Evaluation
Submission
Related Work
Task Committee

Synopsis

Input: [Source Collection] [2012 Target Collection] [Relevance Judgements]
Evaluation: [Evaluation Script]

Task

This edition of CLiNSS focuses on journalistic text reuse. News agencies are a prolific source of text on the web and a valuable source of text in multiple languages. News stories are generated independently and consequently there is a need to link news stories covering the same events written in different languages. Linking these stories vastly enhances the user's experience. In multilingual environment (such as India), a reader might want to refer to the local language version of a news story.

News stories covering the same event published in different languages may be rich sources of parallel and comparable text. Some fragments in these stories are parallel, for example, personal quotes and translated versions of the same content. Identification of such highly similar news stories solves dual purposes, enhancement of reader experience and generation of valuable multilingual resource.

This year we will continue offering the task based around the identification of highly similar journalistic articles and news stories in a cross-language setting. The task will involve identifying and linking highly similar news stories covering the same event published in different languages.

The focus of CL!NSS this year is to evaluate the identification of news stories with same focal event in a cross-language environment. For the given source collection, S, containing news stories in Indian languages, Li, and the target collection, T, containing news stories in English, Lt, the task is to link each news story in T to its corresponding version in S for each Li. The news stories are considered the same if they describe the same focal event. For example, "Housing minister Somanna attacked with slipper" and "BJP leaders condemn attack on Somanna" are two news stories with two different focal events: the former describes the main news event whilst the latter is the consequences of the event. The framework of this year's task is shown in Fig. 2. The languages included in S will be Hindi and Gujarati (and possibly Marathi).

The task is similar to a (cross-language) copy detection task where the query is an entire document and "similar" documents must be found from a set of known documents. The task is not trivial because similar stories may exist with varying degrees of overlapping (e.g. a story written in English and used as the query text may be a subset of a longer story written in a different language, and vice-versa). Table 1 provides an example of relevant and non-relevant English-Hindi text pair. Although the source articles share the same news event as the target, the focal events are similar for source article 1 (relevant), but differ for source article 2 (non-relevant).

Data

The source collection is in Hindi and the target collection is in English. All the documents are with the following text markup.

Evaluation Phase

Let T be a set of target news stories. Let S be a set of potential source news stories. The task is to find and link news stories s in S which have the same news event as of corresponding news story t in T. Moreover, link each t to its corresponding s where t and s also share the same focal event.

The corpus will contain a set of potential source news stories S, written in Hindi, and a set of target news stories T, written in English. In the corpus you will find plain text files encoded in UTF-8. The documents contain meta information including news title and publication date.

Test Collection

The test collection will be composed of the same way as the training collection: a set of target documents together with potential source documents.

Training Corpus Statistics:

50691 Source news stories in Hindi
50 Target news stories in English

Test Corpus Statistics

50691 Source news stories in Hindi
25 Target news stories in English

Output

The participants can use the training data to build their systems, tune parameters, proof-of-concept, manual analysis etc. Although participants can analyse their runs on test data, it is not desirable that they tune system according to test data. It is both, unreliable and unethical, in shared task environment.

Participants are allowed to submit up to three runs per language pair in order to experiment with different settings. Participants are welcome to submit more than three competition runs which will be considered as extra runs (upto three). Please upload separately from competition runs. Such extra runs will not be considered in ranking of systems but might be included in the Evaluation page of the website. Please check the naming convention of such extra runs.

The results of your detection are required to be formatted as <target-docid> Q0 <source-docid> <rank> <similarity>, where

<target-docid> is the id of the corresponding English target document
Q0 is an unused parameter (use as it is)
<source-docid> is the id of the corresponding source document in the respective language
<rank> is the rank given to the source document for the given target document by your system (must be integer)
<similarity> is the similarity score given to the source document for the given target document by your system (must be double)
All the fields are just one <space> delimited.
Participants are required to submit rank-list up-to 100 source news stories for each target news story.
Name of the run file should be in the format of run-<1/2/3>-english-<hindi/gujarati>-<teamname>.txt

For example, a standard run file will look like:

english-document-00001.txt Q0 hindi-document-00345.txt 1 0.4644
english-document-00001.txt Q0 hindi-document-42325.txt 2 0.2823
...
english-document-00050.txt Q0 hindi-document-23443.txt 100 0.1123

Name of the file might be run-1-english-hindi-upv.txt and in case of extra runs extra-run-1-english-hindi-upv.txt

Evaluation

The success of the cross-language news story discovery will be measured in terms of NDCG@1, NDCG@10 and NGCG@20. The relevance level of the source news stories for the given test queries will be in {2,1,0} where,

2 = "same news event + same focal event"
1 = "same news event + different focal event" and
0 = "different news event"

Submission

Submission is closed.

Dragos Munteanu and Daniel Marcu (2005). Improving Machine Translation Performance by Exploiting Comparable Corpora. Computational Linguistics, 31 (4), pp. 477-504, December
Emma Barker and Robert Gaizauskas (2012). Assessing the Comparability of News Texts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12).