Synopsis

  • Task: Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.
  • Input: [data]
  • Evaluation: [code]
  • Baseline: [code]

Input

To develop your software, we provide you with a training corpus that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to various kinds of (automatic) obfuscation to hide the fact it has been reused. Learn more »

Output

Enclosed in the evaluation corpora, a file named pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt and source-documentABC.txt, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

Evaluation

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python. Learn more »

Baseline

For your convenience, we provide a baseline program written in Python. The program loops through the document pairs of a corpus and records the detection results in XML files. The XML files are valid with respect to the output format described below. You may use the source code for getting started with your own approach.

Results

Text Alignment Performance
Plagdet Team
0.7386159 L. Kong°, H. Qi°, S. Wang°, C. Du*, S. Wang*, and Y. Han°
°Heilongjiang Institute of Technology and *Harbin Engineering University, China
0.6826726 Š. Suchomel, J. Kasprzak, and M. Brandejs
Masaryk University, Czech Republic
0.6787810 Cristian Grozea° and Marius Popescu*
°Fraunhofer FOKUS, Germany, and *University of Bucharest, Romania
0.6735574 Oberreuter et al.
Universidad de Chile, Chile
0.6252024 D.A. Rodríguez Torrejón and J.M. Martín Ramos
Universidad de Huelva, Spain
0.5382163 Y. Palkovskii and A. Belov
Zhytomyr State University, Ukraine
0.3499632 R. Küppers and S. Conrad
University of Düsseldorf, Germany
0.3099853 F. Sánchez-Vega, M. Montes-y-Gómez, and L. Villaseñor-Pineda
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
0.3088109 L. Gillam, N. Newbold, and N. Cooke
University of Surrey, UK
0.0452519 A. Jayapal
The University of Sheffield, UK

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Task Committee