Text Alignment 2015 - Corpus Construction

Synopsis
Task
Output
Evaluation
Submission
Related Work
Task Committee

Synopsis

Task: Construct a corpus of text reuse.
Evaluation: [example corpus]

Task

This task may be solved in two alternative ways:

Collection: Find real-world instances of text reuse or plagiarism, and annotate them.

Generation: Given pairs of documents, generate passages of reused or plagiarized text between them. Apply a means of obfuscation of your choosing.

Output

We ask you to prepare your corpus so that its format corresponds to the previous PAN plagiarism corpora. The datasets from the previous years are examples for correctly formatted datasets

Enclosed in the evaluation corpora, a file named pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt and source-documentABC.txt, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

Evaluation

Performance will be measured by assessing the validity of your corpus in two ways.

Detection: Your corpus will be fed into the text alignment prototypes that have been submitted in previous years to the text alignment task. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured using macro-averaged precision and recall, granularity, and the plagdet score.

Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review. Every participant will be given a chance to assess and analyze the corpora of all other participants in order to determine corpus quality.