Cross-Language Indian Text Reuse 2011

Synopsis

Introduction

With the advent of the World Wide Web, information in many different formats is easily accessible. Texts, images, videos and audios are all available for consult, download, and modification. Under these circumstances, text re-use has increased in the last years. In particular, plagiarism has been defined by IEEE as the reuse of someone else's prior ideas, processes, results, or words without explicitly acknowledging the original author and source. The problem has requested the attention from many research areas, even generating new terms, such as the known as copy&paste syndrome or a new kind of text re-use: cyberplagiarism.

While people have enough expertise to detect re-use of text when reading a document, the scale of potential source documents (that of the Web) makes manual analysis unfeasible. As a countermeasure, different systems that assist in the detection of text re-use have been developed. The main idea is to automatically detect such text fragments in a document that are suspicious of being re-used and, if available, provide its presumable source. In that way, on the basis of given linguistic evidence , a human can take a final decision.

Recent efforts have been conducted to the better development of models for detection of text re-use. Probably one of the most interesting cases is the PAN, International Competition on Plagiarism Detection held in conjunction with CLEF. A special kind of phenomenon is cross-language text re-use, where the re-used text fragment and its source are written in different languages, making its automatic detection even harder than for the monolingual case. Cross-language text re-use detection has been nearly approached in the last years, and better models are necessary. Through in the current initiative we aim to further impulse the development of better models for text re-use detection and, in particular, cross-language text re-use detection. Our interest in the second kind of text re-use is motivated by the following facts:

  • Speakers of less-resourced languages (also known as under resourced languages) are forced to consult documentation in a foreign language; and
  • People immerse in a foreign country can still consult material written in their native language.

Such environments cause the commitment of cross-language text re-use more likely and become it an interesting problem nowadays.

Task

The focus of the CL!TR evaluation task is on cross-language text re-use detection. To start with, in this year's task, we are targeting two languages: English - Hindi. The source text is in English and the suspicious text is in Hindi. You are provided with a set of suspicious documents in Hindi and a set of potential source documents in English. The task is to identify the documents in the suspicious set (Hindi) that are created by re-using fragments from the source set (English). You are expected to identify suspicious documents which have been actually generated by re-use together with their corresponding sources. Note that this is a document level task. No specific fragments inside of the documents are expected to be identified; only pairs of documents. Determining either a text has been re-used from its corresponding source is enough. Specifying the kind of re-use (Exact, Heavy, or Light) is not necessary.

CL!TR is divided in two phases: training and test. For the training phase we provide an annotated corpus including different levels of re-use. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is. In the test phase no annotation or hints about the cases are provided.

Result Submission

The results of your re-use detection software are required to be formatted in XML:
<document>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."    <!-- file name of the source document -->
/>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."   <!-- file name of the source document -->
/>
.........................    <!-- more detections in the collection -->
</document >
For each pair of suspicious and source document there will be one entry of the <reuse_case .../> in the xml file.

Data

Let S be a set of suspicious documents. Let D be a set of potential source documents. The task is to find those documents s in S which have been actually re-used and their source document d in D. The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files encoded in UTF-8. The source documents are taken from English Wikipedia. The source documents include Wiki-mark up.

Training Collection

In order to prepare and develop your detection software we provide with a training collection. Such a collection includes annotations for every case of re-use.

  • 5032 Source files in English
  • 198 suspicious files in Hindi

Test Collection

The test collection is composed on the same way than the training collection: a set of suspicious together with potential source documents.

  • 5032 Source files in English
  • 190 suspicious files in Hindi

Submission of Detection Results

Participants are allowed to submit up to three runs in order to experimenting with different settings. The results of your detection are required to be formatted in XML. The result document must be valid with respect to the following XML schema:

<document>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."    <!-- file name of the source document -->
/>
<reuse_case
  reused_reference="..."    <!-- file name of the suspicious document -->
  source_reference="..."   <!-- file name of the source document -->
/>
.........................    <!-- more detections in the collection -->
</document >

Evaluation

The success of a text re-use detection will be measured in terms of its Precision (P), Recall (R), and F-measure (F) on detecting the re-used documents together with their source in the test corpus. A detection is considered correct if the re-used document is identified together with its corresponding source document. We consider:

  • total detected to be the set of suspicious-source pairs detected by the system.
  • correctly detected to be the subset of pairs detected by the system which actually compose cases of re-use.
  • total re-used to be the gold standard, which includes all those pairs which compose actual re-used cases.
P, R and F are defined as follows: $$ \text{P} = {\text{correctly detected} \over \text{total deteted}} $$ $$ \text{R} = {\text{correctly detected} \over \text{total re-used}} $$ $$ \text{F-measure} = { {2 * \text{R} * \text{P}} \over {\text{R} + \text{P}} } $$ A reference implementation of the measures, coded in Perl, is no longer available.

Results

Participants

Participant Institution Country
Aniruddha GhoshJadavpur UniversityIndia
Karteek Addanki et al.Hong Kong University of Science and TechnologyHong Kong (China)
Nitish Aggarwal et al.DERI Galway and UPM MadridIreland / Spain
Parth Gupta et al.UPV & DA-IICTSpain / India
Rambhoopal K.IIIT HyderabadIndia
Yurii PalkovskiiZhytomyr State University / SkyLine Inc.Ukraine

Ranking

Rank F-measure Recall Precision Run Leader
10.6490.7500.5713Rambhoopal K.
20.6090.8210.4841Nitish Aggarwal
30.6080.6430.5762Rambhoopal K.
40.6030.5890.6171Yurii Palkovskii
50.5960.8040.4742Parth Gupta
60.5890.7950.4682Nitish Aggarwal
70.5760.5890.5641Rambhoopal K.
80.5410.4730.6312Yurii Palkovskii
90.5230.5000.5493Yurii Palkovskii
100.5090.6070.4393Parth Gupta
110.4300.5800.3421Parth Gupta
120.2200.2140.2262Aniruddha Ghosh
130.2200.2140.2263Aniruddha Ghosh
140.0850.1070.0701Aniruddha Ghosh
150.0000.0000.0001Karteek Addanki

Task Committee