Synopsis

  • Task: Given a (potentially obfuscated) text, decide whether it was written by a human or an AI.
  • Registration: [CLEF labs] [Tira]
  • Important dates:
    • May 07, 2026: software submission
    • May 28, 2026: participant notebook submission [template] [submission  – select "Stylometry and Digital Text Forensics (PAN)" ]
  • Data: Human and machine texts [download]
  • Evaluation Measures: F1, C@1, AUC-ROC, FPR, FNR
  • Baselines: SVM, Compression, Binoculars [code]

Task Overview

The Voight-Kampff AI Detection task is a binary AI detection task in that participants are given a text and have to decide whether it was machine-authored (class 1) or human-authored (class 0). However, we introduced a twist: The LLMs were instructed to change their style and mimic a specific human author. Furthermore, the test set will contain several surprises such as new models or unknown obfuscations to test the robustness of the classifiers (however, texts will be from the same domain).

As in the previous year, the Voight-Kampff AI detection Task @ PAN is organized in collaboration with the Voight-Kampff Task @ ELOQUENT Lab Lab in a builder-breaker style. PAN participants will build systems to tell human and machine apart, while ELOQUENT participants will investigate novel text generation and obfuscation methods for avoiding detection.

Data

The dataset is available via Zenodo. Please register first at Tira and then request access on Zenodo using the same email address. The dataset contains copyrighted material and may be used only for research purposes. No redistribution allowed.

The training and validation dataset is provided as a set of newline-delimited JSON files. Each file contains a list of texts, written either by a human or a machine. The file format is as follows:

{"id": "a6c8018e-d22c-4d6e-b5e3-0c0a65682a6a", "text": "...", "model": "human", "label": 0, "genre": "essays"}
{"id": "f1a26761-ca2a-43e9-890d-80dcb3058364", "text": "...", "model": "gpt-4o", "label": 1, "genre": "essays"}
...

A "label" of 0 means human-written, 1 is ai-written. "genre" is for informational purposes only and can be either "essays", "news", or "fiction". Texts with "genre": "news" are sampled from last year's dataset (but with a few additions, such as GPT-4o). So if you want to reuse last year's dataset, be aware that some texts will be duplicates!

The test dataset will have the same format, but with only the "id" and "text" columns.

Submission

Participants will submit their systems as Docker images through the Tira platform. It is not expected that submitted systems are actually trained on Tira, but they must be standalone and runnable on the platform without requiring contact to the outside world (evaluation runs will be sandboxed).

The submitted software must be executable inside the container via a command line call. The script must take two arguments: an input file (an absolute path to the input JSONL file) and an output directory (an absolute path to where the results will be written):

Within Tira, the input file will be called dataset.jsonl, so with the pre-defined Tira placeholders, your software should be invoked like this:

$ mySoftware $inputDataset/dataset.jsonl $outputDir

Within $outputDir, a single (!) file with the file extension *.jsonl must be created with the following format:

{"id": "bea8cccd-0c99-4977-9c1b-8423a9e1ed96", "label": 1.0}
{"id": "a963d7a0-d7e9-47c0-be84-a40ccc2005c7", "label": 0.2315}
...

For each test case in the input file, an output line must be written with the ID of the input text pair and a confidence score between 0.0 and 1.0. A score < 0.5 means that the text is believed to be human-authored. A score > 0.5 means that it is likely machine-written. A score of exactly 0.5 means the case is undecidable. Participants are encouraged to answer with 0.5 rather than making a wrong prediction. You can also give binary score (0 and 1) if your detector does not output class probabilities.

All test cases must be processed in isolation without information leakage between them! Even though systems may be given an input file with multiple JSON lines at once for reasons of efficiency, these inputs must be processed and answered just the same as if only a single line were given. Answers for any one test case must not depend on other cases in the input dataset!

Evaluation

Systems will be evaluated with the same measures as previous installments of the PAN authorship verification tasks. The following metrics will be used:

  • ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve.
  • Brier: The complement of the Brier score (mean squared loss).
  • C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of the remaining cases.
  • F1: The harmonic mean of precision and recall.
  • F0.5u: A modified F0.5 measure (precision-weighted F measure) that treats non-answers (score = 0.5) as false negatives.
  • The arithmetic mean of all the metrics above.
  • A confusion matrix for calculating true/false positive/negative rates.

The evaluator for the task will output the above measures as JSON like so:

{
    "roc-auc": 0.996,
    "brier": 0.951,
    "c@1": 0.984,
    "f1": 0.98,
    "f05u": 0.981,
    "mean": 0.978,
    "confusion": [
        [
            1211,
            66
        ],
        [
            27,
            2285
        ]
    ]
}

Baselines

We provide three LLM detection baselines:

  • Linear SVM with TF-IDF features
    Validation: [ROC-AUC: 0.996; Brier: 0.951; C@1: 0.984; F1: 0.980; F0.5u: 0.981; Mean: 0.978]
  • PPMd Compression-based Cosine [Sculley and Brodley, 2006] [Halvani et al., 2017]
    Validation: [ROC-AUC: 0.786; Brier: 0.799; C@1: 0.757; F1: 0.812; F0.5u: 0.778; Mean: 0.786]
  • Binoculars [Hans et al., 2024]
    Validation: [ROC-AUC: 0.918; Brier: 0.867; C@1: 0.844; F1: 0.872; F0.5u: 0.882; Mean: 0.877]

With TF-IDF SVM and PPMd CBC, we provide two bag-of-words authorship verification models. Binoculars uses large language models to measure text perplexity. The SVM classifier is a supervised LLM detector, the other two are unsupervised / zero-shot models. The baselines are published on GitHub. You can run them locally, in a Docker container, or using tira-run. All baselines come with a CLI and usage instructions. Their general usage is:

$ pan25-baseline BASELINENAME INPUT_FILE OUTPUT_DIRECTORY
Use --help on any subcommand for more information:
$ pan25-baseline --help
Usage: pan25-baseline [OPTIONS] COMMAND [ARGS]...

  PAN'25 Generative AI Authorship Verification baselines.

Options:
  --help  Show this message and exit.

Commands:
  binoculars  PAN'25 baseline: Binoculars.
  ppmd        PAN'25 baseline: Compression-based cosine.
  tfidf       PAN'25 baseline: TF-IDF SVM.

More information on how to install and run the baselines can be found in the README on GitHub.

Leaderboard

TBD

Task Committee