Text Alignment 2015 - Corpus Construction
Synopsis
- Task: Construct a corpus of text reuse.
- Evaluation: [example corpus]
Task
This task may be solved in two alternative ways:
Collection: Find real-world instances of text reuse or plagiarism, and annotate them.
Generation: Given pairs of documents, generate passages of reused or plagiarized text between them. Apply a means of obfuscation of your choosing.
Output
We ask you to prepare your corpus so that its format corresponds to the previous PAN plagiarism corpora. The datasets from the previous years are examples for correctly formatted datasets
Enclosed in the evaluation corpora, a file named pairs
is found, which lists all
pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt
and source-documentABC.txt
, your plagiarism detector shall output an XML file
suspicious-documentXYZ-source-documentABC.xml
which contains meta information about the plagiarism cases detected within:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" /> <feature ... /> ... </document>
For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt
and source-documentABC.txt
, and that it is of length 1000 characters, starting at
character offset 5 in the suspicious document and at character offset 100 in the source document.
Evaluation
Performance will be measured by assessing the validity of your corpus in two ways.
Detection: Your corpus will be fed into the text alignment prototypes that have been submitted in previous years to the text alignment task. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured using macro-averaged precision and recall, granularity, and the plagdet score.
Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review. Every participant will be given a chance to assess and analyze the corpora of all other participants in order to determine corpus quality.
Submission
To submit your corpus, put it in a ZIP archive, and make it available to us via a file sharing service of your choosing, e.g., Dropbox, or Mega.
Related Work
- Plagiarism Detection, PAN @ CLEF'14
- Plagiarism Detection, PAN @ CLEF'13
- Plagiarism Detection, PAN @ CLEF'12
- Plagiarism Detection, PAN @ CLEF'11
- Plagiarism Detection, PAN @ CLEF'10
- Plagiarism Detection, PAN @ SEPLN'09