Generative Plagiarism Detection
Synopsis
- Task: Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.
- Input: [data].
- Evaluation: [code].
- Submission: Deployment on TIRA [submit]
Task Overview
To develop your software, we provide you with a training and validation corpus that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to automatic LLM paraphrasing to hide the fact it has been reused. Multiple LLMs have been utilized and the documents may contain additional genuine LLM paraphrased text (i.e., it is not reused). The input and output formats are the same as in previous text-alignment tasks. Learn more »
Data
The dataset is available via Zenodo. Please register first at Tira. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed.
Enclosed in the train and validation corpora, two folders are found: (1) the text data and (2) the annotation data (_truths
postfix).
- Text Data: contains a
pairs
file which lists all pairs of suspicious documents (in thesusp
folder) and source documents (in thesrc
folder) to be compared. - Annotation Data: contains XML files for each pair in the
pairs
file providing information about the locations and its source of reused texts.
<document reference="suspicious-documentXYZ.txt"> <feature name="plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" ... /> <feature name="altered" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" ... /> ... </document>
The plagiarism
feature specifies an aligned passage of text between suspicious-documentXYZ.txt
and source-documentABC.txt
, and that it is of length 1000 characters, starting at
character offset 5 in the suspicious document and at character offset 100 in the source
document. The other attributes are used to allow for a more detailed analysis of the results and can be ignored for training.
The altered
feature specifies the location of paraphrased text that was not reused (no plagiarism). This allows
to distinguish between genuine LLM generated texts and reused text. For the evaluation, only the plagiarism
features
need to be predicted.
For each pair suspicious-documentXYZ.txt
and source-documentABC.txt
in the pairs
file,
your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml
which specifies the location of the plagiarism cases detected within. The name of the feature should be detected-plagiarism
and specify the offsets and lengths in the suspicious and the source document. No other attributes are evaluated. For example:
<document reference="suspicious-documentXYZ.txt"> <feature name="detected-plagiarism" this_offset="5" this_length="1000"source_reference="source-documentABC.txt" source_offset="100" source_length="1000" /> <feature ... /> ... </document>
For evaluation, the offset and length attributes detected-plagiarism
features will be compared against the plagiarism
features in the annotation data.
No other information will be evaluated.
Results
tba.Related Work
- Plagiarism Detection, PAN @ CLEF'14
- Plagiarism Detection, PAN @ CLEF'13
- Plagiarism Detection, PAN @ CLEF'12
- Plagiarism Detection, PAN @ CLEF'11
- Plagiarism Detection, PAN @ CLEF'10
- Plagiarism Detection, PAN @ SEPLN'09