Evaluation Data

Key ingredient to evaluation are data. For PAN's shared tasks on digital text forensics, a number of datasets have been compiled and used to evaluate dozens of approaches. Using these datasets in your research ensures comparability. You are welcome to submit datasets of your own making to PAN's shared tasks.

Register now Next: Technology 0 already signed up

Author Identification

Authorship Attribution

Datasets that comprise texts of questioned authorship and texts from candidate authors.

PAN11: training, test, joint dataset, bib

PAN12: training, test, joint dataset, bib

C10: joint dataset, bib

C50: joint dataset, bib

Authorship Verification

Datasets that comprise sets of documents, where one is questioned to be from the same author as the others.

PAN13: training, test, bib

PAN14: training, test, bib

PAN15: training, test, bib

Author Clustering

Datasets are being prepared for PAN @ CLEF 2016.

Author Linking

Datasets are being prepared for PAN @ CLEF 2016.

Author Diarization

Datasets are being prepared for PAN @ CLEF 2016.

Author Profiling

Gender Prediction

Datasets that comprise sets of texts from the same author with annotations as to their gender.

PAN13: training, test, bib

PAN14: training, test, bib

PAN15: training, test, bib

Age Prediction

Datasets that comprise sets of texts from the same author with annotations as to their age.

PAN13: training, test, bib

PAN14: training, test, bib

PAN15: training, test, bib

Personality Prediction

Datasets that comprise sets of texts from the same author with annotations as to their BIG5 personality type.

PAN15: training, test, bib

Author Obfuscation

Author Masking

Datasets are being prepared for PAN @ CLEF 2016.

Author Imitation

Datasets are being prepared for PAN @ CLEF 2016.

Text Reuse Detection (aka Plagiarism Detection)

Source Retrieval

Datasets that comprise documents with text reused from the ClueWeb.

PAN12: training, test, bib

PAN13: training, test, bib

PAN14: training, test, bib

PAN15: training, test, bib

Text Alignment

Datasets that comprise pairs of documents that may share reused text.

PAN12: training, test, bib

PAN13: training, test, bib

PAN14: training, test, supplemental test, bib

Alvi15: training, test, bib

Asghari15: training, test, bib

Cheema15: training, test, bib

Hanif15: training, test, bib

Khoshnavataher15: training, test, bib

Kong15: training, test, bib

Mohtaij15: training, test, bib

Palkovskii15: training, test, bib

Intrinsic Plagiarism Detection

Datasets that comprise documents with text reuse.

PAN09: training, test, bib

PAN11: training, test, bib

Cross-language Text Reuse Detection

External Plagiarism Detection

Datasets that comprise documents with text reuse and large collections of potential source documents.

PAN09: training, test, bib

PAN10: training, test, bib

PAN11: training, test, bib

Source Code Reuse Detection

Credibility Analysis

Wikipedia Vandalism Detection

Datasets that comprise samples of Wikipedia edits and annotations whether or not they are vandalism.

PAN10: training, test, bib

PAN11: training, test, bib

Wikipedia Quality Flaw Prediction

Datasets that comprise samples of Wikipedia articles and annotations with regard to quality flaws.

PAN12: training, test, bib

Credibility Analysis

Sexual Predator Identification

Datasets that comprise conversations with sexual content, where one party pretends to be a minor.

PAN12: training, test, bib

Submit Data

Got a new dataset for one of PAN's task? We welcome the submission of new datasets to all tasks. As long as the dataset is formatted the same way as the other datasets for the same task, all software submitted by previous participants on that task can be run against it.

Submit now