PAN 2019 now live!

Plagiarism Detection

PAN @ Online 2019

This task is divided into source retrieval and text alignment. Only the text alignment task is available as of today.

Source Retrieval

At this stage, we will focus on the text alignment task alone. Stay tuned, as we will try to have source retrieval up soon!

If you are interested in the source retrieval, consider the proceedings and overview of the CLEF 2014 edition of the task (and former editions all the way up to 2009).


Text Alignment

Task

Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.

Training Corpus

To develop your software, we provide you with a training corpus that consists of pairs of documents, one of which may contain passages of text reused from the other. The reused text is subject to various kinds of (automatic) obfuscation to hide the fact it has been reused.

Learn more » Download corpus

Baseline

For your convenience, we provide a baseline program written in Python.

Download program

The program loops through the document pairs of a corpus and records the detection results in XML files. The XML files are valid with respect to the output format described below. You may use the source code for getting started with your own approach.

Output

Enclosed in the evaluation corpora, a file named pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt and source-documentABC.txt, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus 1 Download corpus 2 Download corpus 3

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/corpus -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, your can move to submit your software. Before doing so, we provide your with a software submission readiness tester. Please use this tester to verify that your software works. Since we will be calling your software automatically in much the same ways as the tester does, this lowers the risk ot errors.

Download PAN Software Submission Readiness Tester

When your software is submission-ready, please mail the filled out submission.txt file found along the software submission readiness tester to pan@webis.de.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

The text alignment task has been run since PAN'09; here is a quick list of the respective proceedings and overviews: