Text Alignment 2015 - Corpus Construction


  • Task: Construct a corpus of text reuse.
  • Evaluation: [example corpus]


This task may be solved in two alternative ways:

Collection: Find real-world instances of text reuse or plagiarism, and annotate them.

Generation: Given pairs of documents, generate passages of reused or plagiarized text between them. Apply a means of obfuscation of your choosing.


We ask you to prepare your corpus so that its format corresponds to the previous PAN plagiarism corpora. The datasets from the previous years are examples for correctly formatted datasets

Enclosed in the evaluation corpora, a file named pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt and source-documentABC.txt, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
<feature ... />

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.


Performance will be measured by assessing the validity of your corpus in two ways.

Detection: Your corpus will be fed into the text alignment prototypes that have been submitted in previous years to the text alignment task. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured using macro-averaged precision and recall, granularity, and the plagdet score.

Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review. Every participant will be given a chance to assess and analyze the corpora of all other participants in order to determine corpus quality.


To submit your corpus, put it in a ZIP archive, and make it available to us via a file sharing service of your choosing, e.g., Dropbox, or Mega.

Task Committee