Text Alignment 2014
Synopsis
- Task: Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.
- Input: [data][supplement data]
- Evaluation: [code]
- Baseline: [code]
Input
To develop your software, we provide you with a training corpus that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to various kinds of (automatic) obfuscation to hide the fact it has been reused. Learn more »
Output
Enclosed in the evaluation corpora, a file named pairs
is found, which lists all
pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt
and source-documentABC.txt
, your plagiarism detector shall output an XML file
suspicious-documentXYZ-source-documentABC.xml
which contains meta information about
the plagiarism cases detected within:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" /> <feature ... /> ... </document>
For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt
and source-documentABC.txt
, and that it is of length 1000 characters, starting at
character offset 5 in the suspicious document and at character offset 100 in the source
document.
Evaluation
Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python. Learn more »
Baseline
For your convenience, we provide a baseline program written in Python. The program loops through the document pairs of a corpus and records the detection results in XML files. The XML files are valid with respect to the output format described below. You may use the source code for getting started with your own approach.
Results
Plagdet | Team |
---|---|
0.87818 | Miguel A. Sanchez-Perez, Grigori Sidorov, and Alexander Gelbukh Instituto Politécnico Nacional, Mexico |
0.86933 | Gabriel Oberreuter and Andreas Eiselt Innovand.io, Chile |
0.86806 | Yurii Palkovskii and Alexei Belov Zhytomyr Ivan Franko State University, Ukraine |
0.85930 | Demetrios Glinos University of Central Florida, USA |
0.84404 | Prasha Shrestha, Suraj Maharjan, and Thamar Solorio University of Alabama at Birmingham, USA |
0.82952 | Diego Antonio Rodríguez Torrejón and José Manuel Martín Ramos Universidad de Huelva, Spain |
0.82642 | Philipp Gross and Pashutan Modaresi pressrelations GmbH, Germany |
0.82161 | Leilei Kong, Yong Han, Zhongyuan Han, Haihao Yu, Qibo Wang, Tinglei Zhang, Haoliang
Qi Heilongjiang Institute of Technology, China |
0.67220 | Samira Abnar, Mostafa Dehghani, Hamed Zamani, and Azadeh Shakery University of Tehran, Iran |
0.65954 | Faisal Alvi°, Mark Stevenson*, and Paul Clough* °King Fahd University of Petroleum & Minerals, Saudi Arabia, and *University of Sheffield, UK |
0.42191 | Baseline |
0.28302 | Lee Gillam and Scott Notley University of Surrey, UK |
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.
Related Work
- Plagiarism Detection, PAN @ CLEF'13
- Plagiarism Detection, PAN @ CLEF'12
- Plagiarism Detection, PAN @ CLEF'11
- Plagiarism Detection, PAN @ CLEF'10
- Plagiarism Detection, PAN @ SEPLN'09