Voight-Kampff Generative AI Detection 2025

Synopsis
Task Overview
Subtask 1: Voight-Kampff AI Detection Sensitivity
Subtask 2: Human-AI Collaborative Text Classification
Related Work
Task Committee

Synopsis

Subtask 1: Given a (potentially obfuscated) text, decide whether it was written by a human or an AI.
Subtask 2: Given a document collaboratively authored by human and AI, classify the extent to which the model assisted.
Important dates:
- ~~May 23, 2025~~ May 26, 2025: software submission (extended due to server issues!)
- May 30, 2025: participant notebook submission [template] [submission – select "Stylometry and Digital Text Forensics (PAN)" ]
Data: Human and machine texts [download task 1] [download task 2]
Evaluation Measures: F1, C@1, AUC-ROC, FPR, FNR
Baselines: SVM, Compression, Binoculars [code task 1]; RoBERTa [code task 2]

Task Overview

The Generative AI Authorship Verification Task @ PAN is split into two subtasks [subtask 1, subtask 2]. Participants can submit their systems to either of them or both. Task 1 focuses on the robustness and sensitivity of detection systems, Task 2 focuses on the degree to which a mixed-authorship text is human- or machine-authored. The two tasks have individual datasets.

Submission of participant notebooks is done via EasyChair.

Subtask 1: Voight-Kampff AI Detection Sensitivity

Subtask 1 is a binary AI detection task in that participants are given a text and have to decide whether it was machine-authored (class 1) or human-authored (class 0). However, we introduced a twist: The LLMs were instructed to change their style and mimic a specific human author. Furthermore, the test set will contain several surprises such as new models or unknown obfuscations to test the robustness of the classifiers (however, texts will be from the same domain).

As in the previous year, the Voight-Kampff AI detection Task @ PAN is organized in collaboration with the Voight-Kampff Task @ ELOQUENT Lab Lab in a builder-breaker style. PAN participants will build systems to tell human and machine apart, while ELOQUENT participants will investigate novel text generation and obfuscation methods for avoiding detection.

Data

The dataset is available via Zenodo. Please register first at Tira and then request access on Zenodo using the same email address. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed.

The training and validation dataset is provided as a set of newline-delimited JSON files. Each file contains a list of texts, written either by a human or a machine. The file format is as follows:

{"id": "a6c8018e-d22c-4d6e-b5e3-0c0a65682a6a", "text": "...", "model": "human", "label": 0, "genre": "essays"}
{"id": "f1a26761-ca2a-43e9-890d-80dcb3058364", "text": "...", "model": "gpt-4o", "label": 1, "genre": "essays"}
...

A "label" of 0 means human-written, 1 is ai-written. "genre" is for informational purposes only and can be either "essays", "news", or "fiction". Texts with "genre": "news" are sampled from last year's dataset (but with a few additions, such as GPT-4o). So if you want to reuse last year's dataset, be aware that some texts will be duplicates! The test dataset will have the same format, but with only the "id" and "text" columns.

Submission

Participants will submit their systems as Docker images through the Tira platform. It is not expected that submitted systems are actually trained on Tira, but they must be standalone and runnable on the platform without requiring contact to the outside world (evaluation runs will be sandboxed).

The submitted software must be executable inside the container via a command line call. The script must take two arguments: an input file (an absolute path to the input JSONL file) and an output directory (an absolute path to where the results will be written):

Within Tira, the input file will be called dataset.jsonl, so with the pre-defined Tira placeholders, your software should be invoked like this:

$ mySoftware $inputDataset/dataset.jsonl $outputDir

Within $outputDir, a single (!) file with the file extension *.jsonl must be created with the following format:

{"id": "bea8cccd-0c99-4977-9c1b-8423a9e1ed96", "label": 1.0}
{"id": "a963d7a0-d7e9-47c0-be84-a40ccc2005c7", "label": 0.2315}
...

For each test case in the input file, an output line must be written with the ID of the input text pair and a confidence score between 0.0 and 1.0. A score < 0.5 means that the text is believed to be human-authored. A score > 0.5 means that it is likely machine-written. A score of exactly 0.5 means the case is undecidable. Participants are encouraged to answer with 0.5 rather than making a wrong prediction. You can also give binary score (0 and 1) if your detector does not output class probabilities.

All test cases must be processed in isolation without information leakage between them! Even though systems may be given an input file with multiple JSON lines at once for reasons of efficiency, these inputs must be processed and answered just the same as if only a single line were given. Answers for any one test case must not depend on other cases in the input dataset!

Evaluation

Systems will be evaluated with the same measures as previous installments of the PAN authorship verification tasks. The following metrics will be used:

ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve.
Brier: The complement of the Brier score (mean squared loss).
C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of the remaining cases.
F₁: The harmonic mean of precision and recall.
F_0.5u: A modified F_0.5 measure (precision-weighted F measure) that treats non-answers (score = 0.5) as false negatives.
The arithmetic mean of all the metrics above.
A confusion matrix for calculating true/false positive/negative rates.

The evaluator for the task will output the above measures as JSON like so:

{
    "roc-auc": 0.996,
    "brier": 0.951,
    "c@1": 0.984,
    "f1": 0.98,
    "f05u": 0.981,
    "mean": 0.978,
    "confusion": [
        [
            1211,
            66
        ],
        [
            27,
            2285
        ]
    ]
}

Baselines

We provide three LLM detection baselines:

Linear SVM with TF-IDF features
Validation: [ROC-AUC: 0.996; Brier: 0.951; C@1: 0.984; F₁: 0.980; F_0.5u: 0.981; Mean: 0.978]
PPMd Compression-based Cosine [Sculley and Brodley, 2006] [Halvani et al., 2017]
Validation: [ROC-AUC: 0.786; Brier: 0.799; C@1: 0.757; F₁: 0.812; F_0.5u: 0.778; Mean: 0.786]
Binoculars [Hans et al., 2024]
Validation: [ROC-AUC: 0.918; Brier: 0.867; C@1: 0.844; F₁: 0.872; F_0.5u: 0.882; Mean: 0.877]

With TF-IDF SVM and PPMd CBC, we provide two bag-of-words authorship verification models. Binoculars uses large language models to measure text perplexity. The SVM classifier is a supervised LLM detector, the other two are unsupervised / zero-shot models. The baselines are published on GitHub. You can run them locally, in a Docker container, or using tira-run. All baselines come with a CLI and usage instructions. Their general usage is:

$ pan25-baseline BASELINENAME INPUT_FILE OUTPUT_DIRECTORY

Use --help on any subcommand for more information:

$ pan25-baseline --help
Usage: pan25-baseline [OPTIONS] COMMAND [ARGS]...

  PAN'25 Generative AI Authorship Verification baselines.

Options:
  --help  Show this message and exit.

Commands:
  binoculars  PAN'25 baseline: Binoculars.
  ppmd        PAN'25 baseline: Compression-based cosine.
  tfidf       PAN'25 baseline: TF-IDF SVM.

More information on how to install and run the baselines can be found in the README on GitHub.

Subtask 1 Leaderboard

Final leaderboard ranking for subtask 1. Teams who submitted multiple systems are listed only once with their best-performing system on the main test dataset.

#	Team	Software	ROC-AUC	Brier	C@1	F₁	F_0.5u	Mean	FPR	FNR
1	Macko	mdok	0.995	0.984	0.982	0.989	0.993	0.989	0.006	0.018
2	Valdez-Valenzuela	isg-graph-v3	0.939	0.902	0.897	0.926	0.960	0.929	0.020	0.107
3	Liu	modernbert	0.962	0.891	0.889	0.923	0.963	0.928	0.005	0.120
4	Seeliger	fine-roberta	0.912	0.898	0.896	0.930	0.959	0.925	0.082	0.103
5	Voznyuk	watery-bag	0.899	0.898	0.898	0.929	0.962	0.924	0.035	0.107
6	Yang	pink-condenser	0.930	0.893	0.886	0.920	0.960	0.923	0.018	0.122
7	Zaidi	sensitive-liason	0.931	0.891	0.887	0.924	0.958	0.922	0.062	0.115
8	hello-world	tart-objective	0.963	0.900	0.897	0.904	0.946	0.922	0.106	0.093
	Baseline TF-IDF SVM		0.963	0.900	0.897	0.904	0.946	0.922	0.106	0.093
9	bohan-li	distinct-dachshund	0.951	0.888	0.884	0.919	0.952	0.922	0.052	0.115
10	Marchitan	tangy-arch	0.945	0.890	0.869	0.905	0.952	0.916	0.011	0.142
11	Teja	tomato-conduction	0.897	0.881	0.881	0.916	0.958	0.914	0.005	0.129
12	xlbniu	poky-corgie	0.883	0.875	0.875	0.910	0.950	0.907	0.032	0.132
13	Jimeno-Gonzalez	sluggish-romano	0.941	0.873	0.849	0.892	0.943	0.901	0.029	0.162
14	Völpel	crabby-announcer	0.922	0.881	0.849	0.892	0.936	0.899	0.084	0.151
15	Ochab	big-cv	0.904	0.886	0.846	0.891	0.933	0.897	0.124	0.150
16	Ostrower	metallic-artillery	0.872	0.854	0.854	0.896	0.943	0.891	0.041	0.151
17	Basani	tangy-gorgonzola	0.904	0.864	0.843	0.894	0.943	0.891	0.084	0.160
18	Pudasaini	dense-casket	0.900	0.858	0.844	0.890	0.937	0.891	0.077	0.159
19	Sun	blistering-band	0.903	0.877	0.843	0.883	0.933	0.890	0.087	0.152
20	Larson	persistent-strut	0.830	0.863	0.863	0.910	0.935	0.885	0.234	0.116
21	Huang	connected-svn	0.864	0.827	0.848	0.906	0.869	0.870	1.000	0.000
22	Titze	undecidable-muenster	0.854	0.853	0.808	0.869	0.926	0.865	0.131	0.196
	Baseline Binoculars Llama3.1		0.827	0.856	0.818	0.866	0.907	0.863	0.263	0.173
23	Kumar	deafening-template	0.591	0.826	0.826	0.888	0.873	0.831	0.729	0.057
	Baseline PPMd CBC		0.644	0.802	0.759	0.817	0.839	0.790	0.797	0.137
24	Liang	chromatic-fruit	0.734	0.694	0.694	0.752	0.827	0.751	0.157	0.298

Subtask 2: Human-AI Collaborative Text Classification

In subtask 2, we focus on Human-AI Collaborative Text Classification, where the goal is to categorize documents that have been co-authored by humans and LLMs. Specifically, we aim to classify texts into six distinct categories based on the nature of human and machine contributions:

Fully human-written: The document is entirely authored by a human without any AI assistance.
Human-initiated, then machine-continued: A human starts writing, and an AI model completes the text.
Human-written, then machine-polished: The text is initially written by a human but later refined or edited by an AI model.
Machine-written, then machine-humanized (obfuscated): An AI generates the text, which is later modified to obscure its machine origin.
Machine-written, then human-edited: The content is generated by an AI but subsequently edited or refined by a human.
Deeply-mixed text: The document contains interwoven sections written by both humans and AI, without a clear separation.

Accurately distinguishing between these categories will enhance our understanding of human-AI collaboration and help mitigate the risks associated with synthetic text.

Data

The dataset for Task 2 can be downloaded from Zenodo. More information and baselines can be found in our GitHub repository.

Multi-domain documents (academic, journalism, social media)
Human-written and machine-generated samples (GPT-4, Claude, PaLM)
Collaborative texts with annotation layers for human/machine contributions
Multiple languages supported (English, Spanish, German)

Dataset label distribution:

Label Category	Train	Dev
Machine-written, then machine-humanized	91,232	10,137
Human-written, then machine-polished	95,398	12,289
Fully human-written	75,270	12,330
Human-initiated, then machine-continued	10,740	37,170
Deeply-mixed text (human + machine parts)	14,910	225
Machine-written, then human-edited	1,368	510
Total	288,918	72,661

Submission

Participants only need to submit the predicted labels by a jsonl file named asteam_name_submission_date.jsonl to CodaLab.

The team_name_submission_date.jsonl file should be a lines of objects by the following format:

{"id": "identifier of the test sample", "label": 1}
{"id": "identifier of the test sample", "label": 3}
...

labels are int from 0-5, that is, [0, 1, 2, 3, 4, 5]. Here is mapping.

id2label = {
0: “fully human-written”, 
1: “human-written, then machine-polished”, 
2: “machine-written, then machine-humanized”, 
3: “human-initiated, then machine-continued”,
4: “deeply-mixed text; where some parts are written by a human and some are generated by a machine”, 
5: “machine-written, then human-edited”
}

Check the correctness of your submission format by format_checker.py in our GitHub repository. This script checks whether the results format is correct. It also provides some warnings about possible errors.

Baselines

Zero-shot detectors: DetectGPT, Binoculars
Fine-tuned LLMs: RoBERTa-base, DeBERTa-v3
Ensemble methods with stylometric features

Code for the task baseline can be found in our GitHub repository.

Subtask 2 Leaderboard

Final leaderboard ranking for subtask 2. Teams ranked 17 to 21 submitted files with misaligned test set ids (marked with *). This misalignment led to an underestimation of their system performances. We present their corrected scores in light green in the updated ranking.

#	Team	Recall	F₁	Accuracy
1	mdok	64.46	65.06	74.09
2	Bohan Li	61.72	61.73	69.28
	lza*	60.61	61.43	70.15
3	Advacheck	60.16	60.85	69.04
4	StarBERT	57.46	56.31	66.81
5	Atu	56.87	56.45	66.30
6	TaoLi	56.74	55.39	66.27
7	ReText	56.11	55.25	64.79
	hkkk*	56.01	55.79	66.48
8	DetectTeam	54.49	54.40	62.89
	NanMu*	54.39	53.63	64.62
9	WeiDongWu	54.09	53.57	63.01
10	zhangzhiliang	54.06	52.81	61.65
11	CNLP-NITS-PP	54.05	53.49	62.23
12	a.dusuki	52.83	51.44	60.45
13	Steely	52.14	51.81	59.88
14	a.elnenaey	49.56	50.10	58.96
	Baseline	48.32	47.82	57.09
15	VerbaNex AI	47.15	47.15	56.24
	johanjthomas*	45.30	43.94	53.42
16	Unibuc-NLP	44.33	42.76	51.42
	Nexus Interrogators (late submission)	33.86	31.86	35.45
17	johanjthomas*	33.71	31.63	37.85
18	lza*	32.90	31.98	33.20
19	NanMu*	32.87	31.79	34.52
20	hkkk*	32.79	31.95	34.21
21	YoussefAhmed21*	16.48	14.98	21.22

Janek Bevendorff, Matti Wiegmann, Jussi Karlgren, Luise Dürlich, Evangelia Gogoulou, Aarne Talman, Efstathios Stamatatos, Martin Potthast, and Benno Stein. Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024. In Guglielmo Faggioli, Nicola Ferro, Petra Galuščáková, and Alba García Seco de Herrera, editors, Working Notes of CLEF 2024 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, pages 2486-2506, September 2024. CEUR-WS.org.
Bevendorff, Janek, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, et al. 2024. Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification: Extended Abstract. In Lecture Notes in Computer Science, 3-10. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland.
Uchendu, Adaku, Thai Le, Kai Shu, and Dongwon Lee. 2020. Authorship Attribution for Neural Text Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8384-95. Online: Association for Computational Linguistics.
Jakesch, Maurice, Jeffrey T. Hancock, and Mor Naaman. 2023. Human Heuristics for AI-Generated Language Are Flawed. Proceedings of the National Academy of Sciences of the United States of America 120 (11): e2208839120.
Hans, Abhimanyu, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text. arXiv [Cs.CL].
Su, Jinyan, Terry Yue Zhuo, Di Wang, and Preslav Nakov. 2023. DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text. arXiv [Cs.CL].
Mitchell, Eric, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature. arXiv [Cs.CL].
Bao, Guangsheng, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2023. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. arXiv [Cs.CL].
Koppel, Moshe, and Jonathan Schler. 2004. Authorship Verification as a One-Class Classification Problem. In Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004, 489-95.
Bevendorff, Janek, Benno Stein, Matthias Hagen, and Martin Potthast. 2019. Generalizing Unmasking for Short Texts. In Proceedings of the 2019 Conference of the North, 654-59. Stroudsburg, PA, USA: Association for Computational Linguistics.
Sculley, D., and C. E. Brodley. 2006. Compression and Machine Learning: A New Perspective on Feature Space Vectors. In Data Compression Conference (DCC'06), 332-41. IEEE.
Halvani, Oren, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In ACM International Conference Proceeding Series. Vol. Part F1305. Association for Computing Machinery. https://doi.org/10.1145/3098954.3104050.
Uchendu, Adaku, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. 2021. TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2001-16. Stroudsburg, PA, USA: Association for Computational Linguistics.
Schuster, Tal, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020. The Limitations of Stylometry for Detecting Machine-Generated Fake News. Computational Linguistics 46 (2): 499-510.
Sadasivan, Vinu Sankar, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can AI-Generated Text Be Reliably Detected? arXiv [Cs.CL].
Ippolito, Daphne, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic Detection of Generated Text Is Easiest When Humans Are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1808-22. Stroudsburg, PA, USA: Association for Computational Linguistics.