Generative Plagiarism Detection 2026

Synopsis
Task Overview
Data
Results
Related Work
Task Committee

Synopsis

Task: Given a suspicious document and a collection of potential source documents, your task is to retrieve the source documents in the collection that the suspicious document plagiarizes (similar to the Source Retrieval Task at PAN 2012 to 2015).
Important dates:
- March 13, 2026: dataset release
- ~~May 07, 2026~~ May 21, 2026: software or run submission [Tira]
- May 28, 2026: participant notebook submission [template] [submission – select "Stylometry and Digital Text Forensics (PAN)" ]
Dataset: [test dataset]
Input: [spot-check dataset, test dataset].
Baselines: [code].
Evaluation: We will evaluate standard retrieval measures such as nDCG@10, Recall@10, and Recall@100.
Submission: Software or Run Submission on TIRA [submit]

Task Overview

This task follows the classic retrieval task design. Each given suspicious document (the query) is fully auto-generated via an undisclosed LLM that was instructed to write a new scientific document based of multiple (at least 2) source documents. The goal of this task is to identify all sources for the given query document within a given collection. Please retrieve up to 1000 results (from the source documents) in the TREC-style run format for each suspicious document.

Data

Please register first at Tira. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed. We will prepare the data in a format compatible with ir_datasets so that you can directly load the data with the ir_datasets API (please look at the baseline for an example).

Accessing the data via ir_datasets works like:

from tira.third_party_integrations import ir_datasets

# spot-check-dataset-20260227-training is the ID of the spot-check dataset

dataset = ir_datasets.load("pan26-generated-plagiarism-detection/spot-check-dataset-20260227-training")



#access the to-be-retrieved documents via.

dataset.docs_iter()



#access the suspicios documents via:

dataset.queries_iter()

You can also access the data without ir_datasets. The data consists of two files, first the corpus.jsonl.gz file contains all potential source documents (the documents to retrieve from), and queries.jsonl contains all queries (each query is a suspicious document for which we look for the source documents that it plagiarized).

The corpus.jsonl.gz file has the following structure:

{"doc_id": "123", "default_text": "The text of the arxiv paper with ID 123"}

The queries.jsonl file has the following structure:

{"qid": "1", "query": "The text of the document that we want to check for plagiarism."}

Output

For each suspicious document in the query set, retrieve up to 1000 documents and store the rankings in a run.txt.gz file that is in the TREC run format. Each line should look like:

qid Q0 doc rank score tag

With:

qid: The query ID that identifies the suspicious document for which potential source documents should be retrieved.
Q0: Unused, should always be Q0.
doc: The document ID from the collection of source documents returned by your system for the qid.
rank: The rank the document is retrieved at.
score: The score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. It is important to handle tied scores (trec_eval sorts documents by the score values and not your rank values).
tag: A tag that identifies your system