Voight-Kampff Generative AI Authorship Verification 2024

Synopsis
Task
Data
Submission
Evaluation
Baselines
Results
Citation
Related Work
Task Committee

Synopsis

Task: Given two texts, one authored by a human, one by a machine: pick out the human.
~~Important dates: 13th May 2024 (systems submission; extended, was 6 May), 31 May 2024 (notebook paper submission) [CLEF'24 lab schedule]~~
~~Submission: Deployment on TIRA [submit]~~
Input: Pairs of texts, one of which was written by a human. [download]
Evaluation Measures: ROC-AUC, Brier, C@1, F₁, F_0.5u [code]
Baselines: PPMd CBC, Authorship Unmasking, Binoculars, DetectLLM, DetectGPT, Fast-DetectGPT, Text length [code]
Evaluation Code: [GitHub]
Task Overview Paper: [bib]

Task

With Large Language Models (LLMs) improving at breakneck speed and seeing more widespread adoption every day, it is getting increasingly hard to discern whether a given text was authored by a human being or a machine. Many classification approaches have been devised to help humans distinguish between human- and machine-authored text, though often without questioning the fundamental and inherent feasibility of the task itself.

With years of experience in a related but much broader field—authorship verification—, we set out to answer whether this task can be solved. We start with the simplest arrangement of a suitable task setup: Given two texts, one authored by a human, one by a machine: pick out the human.

The Generative AI Authorship Verification Task @ PAN is organized in collaboration with the Voight-Kampff Task @ ELOQUENT Lab in a builder-breaker style. PAN participants will build systems to tell human and machine apart, while ELOQUENT participants will investigate novel text generation and obfuscation methods for avoiding detection.

Data

Test data for this task will be compiled from the submissions of ELOQUENT participants and will comprise multiple text genres, such as news articles, Wikipedia intro texts, or fanfiction. Additionally, PAN participants will be provided with a bootstrap dataset of real and fake news articles spanning multiple 2021 U.S. news headlines. The bootstrap dataset can be used for training purposes, though we strongly recommend using other sources as well.

Download instructions: The dataset is available via Zenodo. Please register first at Tira and then request access on Zenodo using the same email address. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed.

The bootstrap dataset is provided as a set of newline-delimited JSON files. Each file contains a list of articles, written either by (any number of) human authors or a single machine. That is, the file human.jsonl contains only human texts, whereas a file gemini-pro.jsonl contains articles about the same topics, but written entirely by Google's Gemini Pro. The file format is as follows:

{"id": "gemini-pro/news-2021-01-01-2021-12-31-kabulairportattack/art-081", "text": "..."}
{"id": "gemini-pro/news-2021-01-01-2021-12-31-capitolriot/art-050", "text": "..."}
...

The article IDs and line orderings are the same across all files (except for the model-specific prefix before the first /), so the same line always corresponds to the same topic, but from different “authors”.

The test dataset will be provided in a different format. Instead of individual files, only a single JSONL file will be given, each line containing a pair of texts:

{"id": "iixcWBmKWQqLAwVXxXGBGg", "text1": "...", "text2": "..."}
{"id": "y12zUebGVHSN9yiL8oRZ8Q", "text1": "...", "text2": "..."}
...

The IDs will be scrambled and the participant's task is to generate an appropriate output file with predictions for which of the two texts is the human one (see Submission below).

Evaluation

Systems will be evaluated with the same measures as previous installments of the PAN authorship verification tasks. The following metrics will be used:

ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve.
Brier: The complement of the Brier score (mean squared loss).
C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of the remaining cases.
F₁: The harmonic mean of precision and recall.
F_0.5u: A modified F_0.5 measure (precision-weighted F measure) that treats non-answers (score = 0.5) as false negatives.
The arithmetic mean of all the metrics above.

The evaluator for the task will output the above measures as JSON like so:

{
    "roc-auc": 0.992,
    "brier": 0.979,
    "c@1": 0.978,
    "f1": 0.978,
    "f05u": 0.978,
    "mean": 0.981
}

Submission

Participants will submit their systems as Docker images through the Tira platform. It is not expected that submitted systems are actually trained on Tira, but they must be standalone and runnable on the platform without requiring contact to the outside world (evaluation runs will be sandboxed).

The submitted software must be executable inside the container via a command line call. The script must take two arguments: an input file (an absolute path to the input JSONL file) and an output directory (an absolute path to where the results will be written):

Within Tira, the input file will be called dataset.jsonl, so with the pre-defined Tira placeholders, your software should be invoked like this:

$ mySoftware $inputDataset/dataset.jsonl $outputDir

Within $outputDir, a single (!) file with the file extension *.jsonl must be created with the following format:

{"id": "iixcWBmKWQqLAwVXxXGBGg", "is_human": 1.0}
{"id": "y12zUebGVHSN9yiL8oRZ8Q", "is_human": 0.3}
...

For each test case in the input file, an output line must be written with the ID of the input text pair and a confidence score between 0.0 and 1.0. A score < 0.5 means that text1 is believed to be human-authored. A score > 0.5 means that text2 is believed to be human-authored. A score of exactly 0.5 means the case is undecidable. Participants are encouraged to answer with 0.5 rather than making a wrong prediction.

All test cases must be processed in isolation without information leakage between them! Even though systems may be given an input file with multiple JSON lines at once for reasons of efficiency, these inputs must be processed and answered just the same as if only a single line were given. Answers for any one test case must not depend on other cases in the input dataset!

Tip: You can test your submission using the pan24-generative-authorship-smoke-test dataset and evaluator (which is different from the final test dataset).

Baselines

We provide six (seven) LLM detection baselines as re-implementations from the original papers:

PPMd Compression-based Cosine [Sculley and Brodley, 2006] [Halvani et al., 2017]
Authorship Unmasking [Koppel and Schler, 2004] [Bevendorff et al., 2019]
Binoculars [Hans et al., 2024]
DetectLLM LRR and NPR [Su et al., 2023]
DetectGPT [Mitchell et al., 2023]
Fast-DetectGPT [Bao et al., 2023]
Text length

With PPMd CBC and Authorship unmasking, we provide two bag-of-words authorship verification models. Binoculars, DetectLLM, and DetectGPT use large language models to measure text perplexity. The text length baseline serves mainly as a data sanity check and is designed to have random performance.

The baselines are published on GitHub. You can run them locally, in a Docker container or using tira-run. All baselines come with a CLI and usage instructions. Their general usage is:

$ baseline BASELINENAME [OPTIONS] INPUT_FILE OUTPUT_DIRECTORY

Use --help on any subcommand for more information:

$ baseline --help
Usage: baseline [OPTIONS] COMMAND [ARGS]...

  PAN'24 Generative Authorship Detection baselines.

Options:
  --help  Show this message and exit.

Commands:
  binoculars     PAN'24 baseline: Binoculars.
  detectgpt      PAN'24 baseline: DetectGPT.
  detectllm      PAN'24 baseline: DetectLLM.
  fastdetectgpt  PAN'24 baseline: Fast-DetectGPT.
  length         PAN'24 baseline: Text length.
  ppmd           PAN'24 baseline: Compression-based cosine.
  unmasking      PAN'24 baseline: Authorship unmasking.

More information on how to install and run the baselines can be found in the README on GitHub.

Results

In the following are the final scores of all teams participating in the 2024 shared task. The individual effectiveness scores are aggregates across all test datasets corrected by half a standard deviation to penalize unstable classification performance. The ranking is based on the mean of all individual scores (last column).

If teams submitted multiple systems, only the system performing best on the main test dataset is included in the ranking. Scores marked with * are only approximate, since the system failed to run on one or more datasets and the missing scores were filled with the mean score of all other systems.

#	Team	System	ROC-AUC	Brier	C@1	F₁	F_0.5u	Mean
1	marsan	staff-trunk	0.961	0.928	0.912	0.884	0.932	0.924
2	you-shun-you-de	charitable-mole_v3	0.931	0.926	0.928	0.905	0.913	0.921
3	baselineavengers	svm	0.925	0.869	0.882	0.875	0.869	0.886
4	g-fosunlpteam	gritty-producer	0.889	0.875	0.887	0.884	0.884	0.884
5	lam	blistering-moss	0.851	0.850	0.850	0.852	0.849	0.851
6	drocks	muffled-stock	0.866	0.863	0.834	0.825	0.820	0.843
7	aida	corporate-burn	0.831	0.825	0.795	0.788	0.782	0.806
8	cnlp-nits-pp	direct-velocity	0.844	0.793	0.805	0.789	0.792	0.806
9	fosu-stu	proud-stick	0.833	0.867	0.799	0.748	0.767	0.804
10	ap-team	marinated-pantone	0.853	0.862	0.795	0.718	0.742	0.796
11	heartsteel	canary-paint	0.777	0.777	0.777	0.780	0.777	0.778
12	verification-team	merciless-lease	0.799	0.788	0.740	0.740	0.741	0.763
	Baseline Binoculars (Falcon-7B)		0.751	0.780	0.734	0.720	0.720	0.741
13	huangbaijian	bitter-metaphor	0.756*	0.782*	0.726*	0.706*	0.703*	0.735*
14	iimasnlp	final-run4-gnnllm_llmft_stylofeat-partitionB	0.741*	0.760*	0.718*	0.711*	0.695*	0.727*
15	no-999	method2	0.901	0.758	0.733	0.549	0.653	0.722
16	j1j	product-dust	0.692	0.678	0.678	0.732	0.680	0.694
17	jaha	greasy-chest	0.736	0.731	0.731	0.590	0.614	0.683
18	qinnnruiii	tender-couple	0.689*	0.730*	0.672*	0.652*	0.652*	0.680*
	Baseline Binoculars (Mistral-7B)		0.676	0.711	0.663	0.654	0.648	0.671
	Baseline DetectLLM-LRR (Mistral-7B)		0.656	0.758	0.617	0.618	0.618	0.654
19	petropoulossiblings	extended-parakeet	0.594	0.694	0.670	0.631	0.590	0.641
	Baseline Fast-DetectGPT (Mistral-7B)		0.637	0.710	0.616	0.611	0.608	0.638
20	turtlewu	0.981_smoke_transofrmer	0.645	0.649	0.587	0.578	0.577	0.608
	Baseline Text Length		0.608	0.607	0.607	0.596	0.596	0.604
21	gra	ash-causeway	0.500	0.750	0.467	0.634	0.521	0.574
22	you-na-you-de	lzj_v2	0.593	0.598	0.598	0.458	0.565	0.565
23	team-chenteam-chen123hhh	beige-limit	0.627	0.660	0.590	0.442	0.433	0.555
	Baseline PPMd CBC		0.555	0.622	0.523	0.508	0.507	0.544
24	ai-team	ash-ricotta	0.525	0.622	0.506	0.499	0.498	0.531
	Baseline DetectLLM-NPR (Mistral-7B)		0.497	0.602	0.494	0.481	0.481	0.512
25	younanyousha	brownian-architect	0.598	0.604	0.604	0.318	0.378	0.504
	Baseline Fast-DetectGPT (Falcon-7B)		0.480	0.626	0.474	0.457	0.458	0.500
26	foshan-university-of-guangdong	adjacent-rate	0.464	0.660	0.462	0.448	0.448	0.497
27	e-comm-tech	great-plan	0.463	0.651	0.467	0.445	0.446	0.497
	Baseline DetectGPT (Mistral-7B)		0.472	0.552	0.476	0.468	0.465	0.488
28	yomiya	current-boutique	0.645	0.798	0.325	0.307	0.323	0.480
	Baseline DetectLLM-NPR (Falcon-7B)		0.445	0.575	0.449	0.432	0.433	0.468
	Baseline Authorship Unmasking		0.586	0.749	0.337	0.323	0.328	0.467
29	karami-kheiri	bare-broker	0.627	0.789	0.304	0.282	0.296	0.460
	Baseline DetectLLM-LRR (Falcon-7B)		0.441	0.600	0.428	0.413	0.413	0.460
30	lm-detector	detector	0.493	0.586	0.409	0.366	0.382	0.450
	Baseline DetectGPT (Falcon-7B)		0.409	0.526	0.425	0.413	0.412	0.439

Citation

To cite the Voight-Kampff Generative AI Authorship Verification Task @ PAN'24, please use the following BibTeX entry:

@InProceedings{bevendorff:2024,
  author =                   {Janek Bevendorff and Matti Wiegmann and Jussi Karlgren and Luise D{\"u}rlich and Evangelia Gogoulou and Aarne Talman and Efstathios Stamatatos and Martin Potthast and Benno Stein},
  booktitle =                {Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
  editor =                   {Guglielmo Faggioli and Nicola Ferro and Petra Galu\v{s}\v{c}{\'a}kov{\'a} and Alba Garc{\'\i}a Seco de Herrera},
  month =                    sep,
  publisher =                {CEUR-WS.org},
  series =                   {CEUR Workshop Proceedings},
  site =                     {Grenoble, France},
  title =                    {{Overview of the ``Voight-Kampff'' Generative AI Authorship Verification Task at PAN and ELOQUENT 2024}},
  year =                     2024
}

Bevendorff, Janek, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, et al. 2024. Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification: Extended Abstract. In Lecture Notes in Computer Science, 3–10. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland.
Uchendu, Adaku, Thai Le, Kai Shu, and Dongwon Lee. 2020. Authorship Attribution for Neural Text Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8384–95. Online: Association for Computational Linguistics.
Jakesch, Maurice, Jeffrey T. Hancock, and Mor Naaman. 2023. Human Heuristics for AI-Generated Language Are Flawed. Proceedings of the National Academy of Sciences of the United States of America 120 (11): e2208839120.
Hans, Abhimanyu, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text. arXiv [Cs.CL].
Su, Jinyan, Terry Yue Zhuo, Di Wang, and Preslav Nakov. 2023. DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text. arXiv [Cs.CL].
Mitchell, Eric, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature. arXiv [Cs.CL].
Bao, Guangsheng, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2023. Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. arXiv [Cs.CL].
Koppel, Moshe, and Jonathan Schler. 2004. Authorship Verification as a One-Class Classification Problem. In Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004, 489–95.
Bevendorff, Janek, Benno Stein, Matthias Hagen, and Martin Potthast. 2019. Generalizing Unmasking for Short Texts. In Proceedings of the 2019 Conference of the North, 654–59. Stroudsburg, PA, USA: Association for Computational Linguistics.
Sculley, D., and C. E. Brodley. 2006. Compression and Machine Learning: A New Perspective on Feature Space Vectors. In Data Compression Conference (DCC’06), 332–41. IEEE.
Halvani, Oren, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In ACM International Conference Proceeding Series. Vol. Part F1305. Association for Computing Machinery. https://doi.org/10.1145/3098954.3104050.
Uchendu, Adaku, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. 2021. TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2001–16. Stroudsburg, PA, USA: Association for Computational Linguistics.
Schuster, Tal, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020. The Limitations of Stylometry for Detecting Machine-Generated Fake News. Computational Linguistics 46 (2): 499–510.
Sadasivan, Vinu Sankar, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can AI-Generated Text Be Reliably Detected? arXiv [Cs.CL].
Ippolito, Daphne, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic Detection of Generated Text Is Easiest When Humans Are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1808–22. Stroudsburg, PA, USA: Association for Computational Linguistics.