Evaluation Data

Key ingredient to evaluation are data. For PAN's shared tasks on digital text forensics, a number of datasets have been compiled and used to evaluate dozens of approaches. Using these datasets in your research ensures comparability. You are welcome to submit datasets of your own making to PAN's shared tasks.

Authorship (14 datasets)

Author Identification
Name Downloads Year Size [bytes] Size Task Extras
C50-Attribution [joint] Authorship Attribution
C10-Attribution [joint] Authorship Attribution
PAN12-Attribution [training] [test] [joint] 2012 Authorship Attribution [bib]
PAN12-Attribution [training] [test] [joint] 2011 Authorship Attribution [bib]
PAN15-Verification [training] [test] 2015 Authorship Verification [bib]
PAN14-Verification [training] [test] 2014 Authorship Verification [bib]
PAN13-Verification [training] [test] 2013 Authorship Verification [bib]
PAN17-Clustering [training] [test] 2017 Authorship Verification [bib]
PAN17-Clustering [training] [test] 2016 Authorship Verification [bib]
Author Profiling
Name Downloads Year Size [bytes] Size Task Extras
PAN15-Gender-Prediction [training] 2015 Gender Prediction [bib]
PAN14-Gender-Prediction [training] 2014 Gender Prediction [bib]
PAN13-Gender-Prediction [training] [test] 2013 Gender Prediction [bib]
PAN15-Age-Prediction [training] 2015 Age Prediction [bib]
PAN14-Age-Prediction [training] 2014 Age Prediction [bib]
PAN13-Age-Prediction [training] [test] 2013 Age Prediction [bib]
PAN15-Personality-Prediction [training] 2015 Age Prediction [bib]

Trust (5 datasets)

Credibility Analysis
Name Downloads Year Size [bytes] Size Task Extras
PAN11-Wikipedia-Vandalism [training] [test] 2010 Authorship Attribution [bib]
PAN10-Wikipedia-Vandalism [training] [test] 2011 Authorship Attribution [bib]
PAN12-Wikipedia-Quality-Flaw-Prediction [training] [test] 2012 Authorship Attribution [bib]
PAN-SemEval-Hyperpartisan-News-Detection-19 [training] 2018 1 GB 751K Hyperpartisan News Detection [bib]
Deception Detection
Name Downloads Year Size [bytes] Size Task Extras
PAN12-Deception-Detection [training] [test] 2012 Authorship Attribution [bib]

Originality (20 datasets)

Text Reuse Detection (aka Plagiarism Detection)
Name Downloads Year Size [bytes] Size Task Extras
PAN12-Source-Retrieval [training] [test] 2012 Source Retrieval [bib]
PAN13-Source-Retrieval [training] [test] 2013 Source Retrieval [bib]
PAN14-Source-Retrieval [training] 2014 Source Retrieval [bib]
PAN15-Source-Retrieval [training] 2015 Source Retrieval [bib]
PAN12-Text-Alignment [training] [test] 2012 Text Alignment [bib]
PAN13-Text-Alignment [training] [test] 2013 Text Alignment [bib]
PAN14-Text-Alignment [training] [test] [supplemental test] 2014 Text Alignment [bib]
Alvi15-Text-Alignment-en-fa [training] 2015 Text Alignment [bib]
Cheema15-Text-Alignment-en [training] 2015 Text Alignment [bib]
Hanfi15-Text-Alignment-en-ur [training] 2015 Text Alignment [bib]
Khoshnavataher15-Text-Alignment-fa [training] 2015 Text Alignment [bib]
Kong15-Text-Alignment-zh [training] 2015 Text Alignment [bib]
Mohtaij15-Text-Alignment-en [training] 2015 Text Alignment [bib]
Palkovskii15-Text-Alignment-en [training] 2015 Text Alignment [bib]
PAN09-Intrinsic-Plagiarism-Detection [training] [test] 2009 Intrinsic Plagiarism Detection [bib]
PAN11-Intrinsic-Plagiarism-Detection [training] [test] 2011 Intrinsic Plagiarism Detection [bib]
PAN09-External-Plagiarism-Detection [training] [test] 2009 External Plagiarism Detection [bib]
PAN10-External-Plagiarism-Detection [training] [test] 2010 External Plagiarism Detection [bib]
PAN11-External-Plagiarism-Detection [training] [test] 2011 External Plagiarism Detection [bib]

Submit Data

Got a new dataset for one of PAN's task? We welcome the submission of new datasets to all tasks. As long as the dataset is formatted the same way as the other datasets for the same task, all software submitted by previous participants on that task can be run against it.

Submit now