Authorship Verification 2020

Synopsis
Task
Data
Evaluation
Results
Related Work
Task Committee

Synopsis

Task: Given two texts, determine if they are written by the same author.
Input: Stories crawled from fanfiction.net; 53k text pairs total; small and large training set [data]
Evaluation: AUC, F1, c@1, F_0.5u [code]
Submission: Deployment on TIRA [submit]
Baseline: TFIDF-weighted char 3-grams cosine similarity; compression method calculating cross-entropy [code]

Task

Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles.

In the coming three years at PAN 2020 to PAN 2022, we develop a new experimental setup that addresses three key questions in authorship verification that have not been studied at scale to date:

Year 1 (PAN 2020): Closed-set verification.
Given a large training dataset comprising of known authors who have written about a given set of topics, the test dataset contains verification cases from a subset of the authors and topics found in the training data.
Year 2 (PAN 2021): Open-set verification.
Given the training dataset of Year 1, the test dataset contains verification cases from previously unseen authors and topics.
Year 3 (PAN 2022): Surprise task.
The task of the last year of this evaluation cycle (to be announced at a later time) will be designed with an eye on realism and practical application.

This evaluation cycle on authorship verification provides for a renewed challenge of increasing difficulty within a large-scale evaluation. We invite you to plan ahead and participate in all three of these tasks.

Data

The train (calibration) and test datasets consists of pairs of (snippets from) two different fanfics, that were obtained drawn from fanfiction.net. Each pair was assigned a unique identifier and we distinguish between same-author pairs and different-authors pairs. Additionally, we offer metadata on the fandom (i.e. thematic category) for each text in the pair (note that fanfic "crossovers" were not included and only single-fandom texts were considered). The fandom distribution in these datasets maximally approximates the (long-tail) distribution of the fandoms in the original dataset. The test dataset is structured in the exact same way, but participants should expect a significant shift in the relation between authors and fandoms.

The training dataset comes in two variants: a smaller dataset, particularly suited for symbolic machine learning methods and a large, dataset, suitable for applying data-hungry deep learning algorithms. Participants have to specify which of the two datasets was used to train their model. Models using the small set will be evaluated separately from models using the large set. We encourage participants to try the small dataset as a challenge, though participants can submit separate submissions for either one or both.

Both the small and the large dataset come with two newline delimited JSON files each (*.jsonl). The first file contains pairs of texts (each pair has a unique ID) and their fandom labels:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "fandoms": ["Fandom 1", "Fandom 2"], "pair": ["Text 1...", "Text 2..."]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "fandoms": ["Fandom 3", "Fandom 4"], "pair": ["Text 3...", "Text 4..."]}
...

The second file, ending in *_truth.jsonl, contains the ground truth for all pairs. The ground truth is composed of a boolean flag indicating if texts in a pair are from the same author and the numeric author IDs:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "same": true, "authors": ["1446633", "1446633"]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "same": false, "authors": ["1535385", "1998978"]}
...

Data and ground truth are in the same order and can be ingested line-wise in parallel without the need for a reshuffle based on the pair ID. The fandom labels will be given in both the training and testing datasets. The ground truth file will only be available for the training data.

Baselines [code]

We provide the following baseline methods:

A simple method that calculates the cosine similarities between TFIDF-normalized, bag-of-character-tetragrams representations of the texts in a pair. The resulting scores are then shifted using a simple grid search, to arrive at an optimal performance on the calibration data.
A simple method based on text compression that given a pair of texts calculates the cross-entropy of text2 using the Prediction by Partial Matching model of text1 and vice-versa. Then, the mean and absolute difference of the two cross-entropies are used by a logistic regression model to estimate a verification score in [0,1].

Note that the above baseline methods do not make use of the fandom information: participants are highly encouraged to exploit this useful metadata in their submissions.

Evaluation [code]

Systems will be compared and ranked on the basis of four, complementary metrics:

AUC: the conventional area-under-the-curve score, as implemented in scikit-learn.
F1-score: the well-known performance measure (not taking into account non-answers), as implemented in scikit-learn.
c@1: a variant of the conventional F1-score, which rewards systems that leave difficult problems unanswered (i.e. scores of exactly 0.5), introduced by Peñas and Rodrigo (2011).
F_0.5u: a newly proposed measure that puts more emphasis on deciding same-author cases correctly (Bevendorff et al. 2019)

A reference evaluation script is made available. Systems will be applied to a pairs-file and are expected to produce a single jsonl-file as result. In this prediction file, each separate line should contain a valid json-string that provides the ID of a pair in the pairs-file and a "value" field, with a floating-point score that is bounded (0 >= score <= 1), indicating the probability that this pair is a same-author text pair. Systems are allowed to leave some problems unanswered: in such cases, the answer can be left out from the prediction file OR its value should be set to exactly 0.5. All answers that have a value of 0.5 will be considered non-decisions.

Submission

This year, each participating team is allowed a maximum of two submissions: one submission trained on the small calibration dataset and one for the large calibration data set. (Participants can also choose to submit just one system.)

Contrary to previous editions, submissions are not expected to be trained on TIRA (i.e. there will no new calibration dataset for the testing phase). The submisisons should therefore contain already fully calibrated models that should only be deployed on TIRA for the actual testing.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the EVALUATION-DIRECTORY and (ii) an absolute path to an OUTPUT-DIRECTORY:

 > mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, a single jsonl-formatted file will be included (pairs.jsonl), containing the text pairs (analogously to the calibration data that was released). The answers should be written to a jsonl-formatted file (answers.jsonl) under EVALUATION-DIRECTORY. Each line should contain a single json-formatted answer, using the following syntax:

{"id": "c04fdf1e-ddf5-5542-96e7-13ce18cae176", "value": 0.4921}
                    {"id": "49dc4cae-3d32-5b4d-b240-a080a1dbb659", "value": 0.5}
                    {"id": "f326fe7c-fc10-566f-a70f-0f36e3f92399", "value": 0.5}
                    {"id": "16daa0d1-61b8-5650-b7ee-5e265bd40910", "value": 0.9333}
                    {"id": "08b536a8-4fed-5f62-97bb-e57f79e841d2", "value": 0.0751}
...

Note: Each verification problem should be solved independently of other problems in the collection.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide linked above.

Once deployed in your virtual machine, we ask you to access TIRA, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The final ranking of the submitted approaches was as follows:

Rank	Team	Training dataset	AUC	c@1	F0.5u	F1-score	Overall
1	boenninghoff20	large	0.969	0.928	0.907	0.936	0.935
2	weerasinghe20	large	0.953	0.880	0.882	0.891	0.902
3	boenninghoff20	small	0.940	0.889	0.853	0.906	0.897
4	weerasinghe20	small	0.939	0.833	0.817	0.860	0.862
5	halvani20b	small	0.878	0.796	0.819	0.807	0.825
6	kipnis20	small	0.866	0.801	0.815	0.809	0.823
7	araujo20	small	0.874	0.770	0.762	0.811	0.804
8	niven20	small	0.795	0.786	0.842	0.778	0.800
9	gagala20	small	0.786	0.786	0.809	0.800	0.796
10	araujo20	large	0.859	0.751	0.745	0.800	0.789
11	baseline (naive)	small	0.780	0.723	0.716	0.767	0.747
12	baseline (compression)	small	0.778	0.719	0.703	0.770	0.742
13	ordonez20	large	0.696	0.640	0.655	0.748	0.685
14	faber20	small	0.293	0.331	0.314	0.262	0.300

A detailed comparison of the approaches will be provided in the task overview paper.

Cross-domain Author identification task at PAN@CLEF'18 (closed-set cross-domain authorship attribution in fanfiction)
Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, March 2008.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.
A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse. In Proc. of the 49th Annual Meeting of the Association for. Computational Linguistics, Vol. 1, pages 1415-1424, 2011.
Bevendorff et al. Generalizing Unmasking for Short Texts, Proceedings of NAACL (2019), 654-659.