Generative Plagiarism Detection

Synopsis
Task
Data
Submission
Results
Related Work
Task Committee

Synopsis

Task: Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.
Input: [data].
Baselines: [code].
Evaluation: [code].
Submission: Deployment on TIRA [submit]

Task Overview

To develop your software, we provide you with a training and validation corpus that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to automatic LLM paraphrasing to hide the fact it has been reused. Multiple LLMs have been utilized and the documents may contain additional genuine LLM paraphrased text (i.e., it is not reused). The input and output formats are the same as in previous text-alignment tasks. Learn more »

Data

The dataset is available via Zenodo. Please register first at Tira. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed.

Enclosed in the train and validation corpora, two folders are found: (1) the text data and (2) the annotation data (_truths postfix).

Text Data: contains a pairs file which lists all pairs of suspicious documents (in the susp folder) and source documents (in the src folder) to be compared.
Annotation Data: contains XML files for each pair in the pairs file providing information about the locations and its source of reused texts.

The annotation data contains the following information that should be used for training:

<document reference="suspicious-documentXYZ.txt">
    <feature
        name="plagiarism"
        this_offset="5"
        this_length="1000"
        source_reference="source-documentABC.txt"
        source_offset="100"
        source_length="1000"
        ...
    />
    <feature
        name="altered"
        this_offset="5"
        this_length="1000"
        source_reference="source-documentABC.txt"
        ...
    />
    ...
    </document>

The plagiarism feature specifies an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document. The other attributes are used to allow for a more detailed analysis of the results and can be ignored for training.

The altered feature specifies the location of paraphrased text that was not reused (no plagiarism). This allows to distinguish between genuine LLM generated texts and reused text. For the evaluation, only the plagiarism features need to be predicted.

For each pair suspicious-documentXYZ.txt and source-documentABC.txt in the pairs file, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which specifies the location of the plagiarism cases detected within. The name of the feature should be detected-plagiarism and specify the offsets and lengths in the suspicious and the source document. No other attributes are evaluated. For example:

<document reference="suspicious-documentXYZ.txt">
    <feature
      name="detected-plagiarism"
      this_offset="5"
      this_length="1000"
      source_reference="source-documentABC.txt"
      source_offset="100"
      source_length="1000"
    />
    <feature ... />
    ...
    </document>

For evaluation, the offset and length attributes detected-plagiarism features will be compared against the plagiarism features in the annotation data. No other information will be evaluated.