Authorship Verification 2021

Synopsis
Task
Data
Evaluation
Results
Related Work
Task Committee

Synopsis

Task: Given two texts, determine if they are written by the same author.
Input: Stories crawled from fanfiction.net; 53k text pairs total; small and large training set [data]
Evaluation: AUC, F1, c@1, F_0.5u, Brier [code]
Submission: Deployment on TIRA [submit]
Baseline: TFIDF-weighted char 3-grams cosine similarity; compression method calculating cross-entropy [code]

Task

Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles.

With PAN 2020, we started to develop a new experimental setup that addresses three key questions in authorship verification that have not been studied at scale to date:

Year 1 (PAN 2020) [last year]: Closed-set verification.
Given a large training dataset comprising of known authors who have written about a given set of topics, the test dataset contains verification cases from a subset of the authors and topics found in the training data.
Year 2 (PAN 2021) [this year]: Open-set verification.
Given the training dataset of Year 1, the test dataset contains verification cases from previously unseen authors and topics.
Year 3 (PAN 2022) [next year]: Surprise task.
The task of the last year of this evaluation cycle (to be announced at a later time) will be designed with an eye on realism and practical application.

This evaluation cycle on authorship verification provides for a renewed challenge of increasing difficulty within a large-scale evaluation. We invite you to plan ahead and participate in all three of these tasks.

***new***: At PAN 2021, the training data is the same as last year, but we proceed to the announced open-set verification scenario and considerable ramp up the difficulty of the task by providing a new test set made entirely of unseen authors and topics.

Data

The train (calibration) and test datasets consists of pairs of (snippets from) two different fanfics, that were obtained drawn from fanfiction.net. Each pair was assigned a unique identifier and we distinguish between same-author pairs and different-authors pairs. Additionally, we offer metadata on the fandom (i.e. thematic category) for each text in the pair (note that fanfic "crossovers" were not included and only single-fandom texts were considered). The fandom distribution in these datasets maximally approximates the (long-tail) distribution of the fandoms in the original dataset. The test dataset is structured in the exact same way, but participants should expect a significant shift in the relation between authors and fandoms.

The training dataset comes in two variants: a smaller dataset, particularly suited for symbolic machine learning methods and a large, dataset, suitable for applying data-hungry deep learning algorithms. Participants have to specify which of the two datasets was used to train their model. Models using the small set will be evaluated separately from models using the large set. We encourage participants to try the small dataset as a challenge, though participants can submit separate submissions for either one or both.

Both the small and the large dataset come with two newline delimited JSON files each (*.jsonl). The first file contains pairs of texts (each pair has a unique ID) and their fandom labels:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "fandoms": ["Fandom 1", "Fandom 2"], "pair": ["Text 1...", "Text 2..."]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "fandoms": ["Fandom 3", "Fandom 4"], "pair": ["Text 3...", "Text 4..."]}
...

The second file, ending in *_truth.jsonl, contains the ground truth for all pairs. The ground truth is composed of a boolean flag indicating if texts in a pair are from the same author and the numeric author IDs:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "same": true, "authors": ["1446633", "1446633"]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "same": false, "authors": ["1535385", "1998978"]}
...

Data and ground truth are in the same order and can be ingested line-wise in parallel without the need for a reshuffle based on the pair ID. The fandom labels will be given in both the training and testing datasets. The ground truth file will only be available for the training data.

Baselines [code]

We provide the following baseline methods:

A simple method that calculates the cosine similarities between TFIDF-normalized, bag-of-character-tetragrams representations of the texts in a pair. The resulting scores are then shifted using a simple grid search, to arrive at an optimal performance on the calibration data.
A simple method based on text compression that given a pair of texts calculates the cross-entropy of text2 using the Prediction by Partial Matching model of text1 and vice-versa. Then, the mean and absolute difference of the two cross-entropies are used by a logistic regression model to estimate a verification score in [0,1].

Note that the above baseline methods do not make use of the fandom information: participants are highly encouraged to exploit this useful metadata in their submissions.

Evaluation [code]

Systems will be compared and ranked on the basis of five, complementary metrics:

AUC: the conventional area-under-the-curve score, as implemented in scikit-learn.
F1-score: the well-known performance measure (not taking into account non-answers), as implemented in scikit-learn.
c@1: a variant of the conventional F1-score, which rewards systems that leave difficult problems unanswered (i.e. scores of exactly 0.5), introduced by Peñas and Rodrigo (2011).
F_0.5u: a newly proposed measure that puts more emphasis on deciding same-author cases correctly (Bevendorff et al. 2019)
Brier: the complement of the well-known Brier score, for evaluating the goodness of (binary) probabilistic classififiers, as implemented in scikit-learn.

These measures are complementary and provide a holistic way to assess a system's performance: c@1 measures the accuracy of binary predictions but also the ability of systems to leave difficult cases unanswered; AUC measures the ability of systems to assign higher scores to positive cases in comparison to negative cases; Brier measures the ability of systems to calibrate the verification scores as probability of a positive answer, etc.

A reference evaluation script is made available [code]. Systems will be applied to a pairs-file and are expected to produce a single jsonl-file as result. In this prediction file, each separate line should contain a valid json-string that provides the ID of a pair in the pairs-file and a "value" field, with a floating-point score that is bounded (0 >= score <= 1), indicating the probability that this pair is a same-author text pair. Systems are allowed to leave some problems unanswered: in such cases, the answer can be left out from the prediction file OR its value should be set to exactly 0.5. All answers that have a value of 0.5 will be considered non-decisions, which will affect some (but not all) evaluation metrics.

Submission

This year, each participating team is allowed a maximum of two submissions: one submission trained on the small calibration dataset and one for the large calibration data set. (Participants can also choose to submit just one system.)

Contrary to previous editions, submissions are not expected to be trained on TIRA (i.e. there will no new calibration dataset for the testing phase). The submisisons should therefore contain already fully calibrated models that should only be deployed on TIRA for the actual testing.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the EVALUATION-DIRECTORY and (ii) an absolute path to an OUTPUT-DIRECTORY:

 > mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, a single jsonl-formatted file will be included (pairs.jsonl), containing the text pairs (analogously to the calibration data that was released). The answers should be written to a jsonl-formatted file (answers.jsonl) under EVALUATION-DIRECTORY. Each line should contain a single json-formatted answer, using the following syntax:

{"id": "c04fdf1e-ddf5-5542-96e7-13ce18cae176", "value": 0.4921}
                    {"id": "49dc4cae-3d32-5b4d-b240-a080a1dbb659", "value": 0.5}
                    {"id": "f326fe7c-fc10-566f-a70f-0f36e3f92399", "value": 0.5}
                    {"id": "16daa0d1-61b8-5650-b7ee-5e265bd40910", "value": 0.9333}
                    {"id": "08b536a8-4fed-5f62-97bb-e57f79e841d2", "value": 0.0751}
...

Note: Each verification problem should be solved independently of other problems in the collection.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide linked above.

Once deployed in your virtual machine, we ask you to access TIRA, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

A tabular representation of the results for this year's campaign is presented below. We would like to thank all participating teams for the valuable contribution to this endeavour. A more extensive evaluation will be reported in the overview papers that will soon be released:

team	training set	AUC	c@1	F1	F0.5u	Brier	Overall
boenninghoff21	large	0.9869	0.9502	0.9524	0.9378	0.9452	0.9545
embarcaderoruiz21	large	0.9697	0.9306	0.9342	0.9147	0.9305	0.9359
weerasinghe21	large	0.9719	0.9172	0.9159	0.9245	0.9340	0.9327
weerasinghe21	small	0.9666	0.9103	0.9071	0.9270	0.9290	0.9280
menta21	large	0.9635	0.9024	0.8990	0.9186	0.9155	0.9198
peng21	small	0.9172	0.9172	0.9167	0.9200	0.9172	0.9177
embarcaderoruiz21	small	0.9470	0.8982	0.9040	0.8785	0.9072	0.9070
menta21	small	0.9385	0.8662	0.8620	0.8787	0.8762	0.8843
rabinovits21	small	0.8129	0.8129	0.8094	0.8186	0.8129	0.8133
ikae21	small	0.9041	0.7586	0.8145	0.7233	0.8247	0.8050
unmasking21	small	0.8298	0.7707	0.7803	0.7466	0.7904	0.7836
tyo21	large	0.8275	0.7594	0.7911	0.7257	0.8123	0.7832
naive21	small	0.7956	0.7320	0.7856	0.6998	0.7867	0.7600
compressor21	small	0.7896	0.7282	0.7609	0.7027	0.8094	0.7581
futrzynski21	large	0.7982	0.6632	0.8324	0.6682	0.7957	0.7516
liaozhihao21	small	0.4962	0.4962	0.0067	0.0161	0.4962	0.3023

Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Martin Potthast & Benno Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020. Working notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, 22-25 September, Thessaloniki, Greece(last year's overview paper on closed-set cross-domain authorship attribution in fanfiction)
Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, March 2008.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.
A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse. In Proc. of the 49th Annual Meeting of the Association for. Computational Linguistics, Vol. 1, pages 1415-1424, 2011.
Bevendorff et al. Generalizing Unmasking for Short Texts, Proceedings of NAACL (2019), 654-659.