Authorship Verification 2023

Synopsis
Task
Data
Evaluation
Related Work
Task Committee

Synopsis

Task: Given two texts belonging to different Discourse Types (DT), determine if they are written by the same author (cross-DT authorship verification).
Input: Pairs of texts from the Aston 100 Idiolects Corpus in English covering DTs of both written and spoken language : essays, emails, interviews, and speech transcriptions. [data]
Evaluation: AUC, F1, c@1, F_0.5u, Brier [code]
Submission: Deployment on TIRA [submit]
Baseline: TFIDF-weighted char n-grams cosine similarity; compression method calculating cross-entropy [code]

Task

Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles. In previous editions of PAN, we explored the effectiveness of authorship verification technology in several languages and text genres. In the 2020 and 2021 editions, cross-domain authorship verification using fanfiction texts was examined. Despite certain differences between fandoms, the task of cross-fandom authorship verification has proved to be relatively feasible. In the 2022 edition, we introduced more challenging scenarios where each author verification case considers two texts that belong to different DTs ( cross-DT authorship verification ). However, all considered DTs correspend to written language. In the current edition, for the first time, we will focus on (cross-discourse type) authorship verification where both written language (i.e., essays and emails) and spoken language (i.e., interviews and speech transcriptions) are represented in the set of discourse types. This will provide the opportunity to study the robustness and effectiveness of stylometric approaches in challenging and intriguing conditions. In addition, the ability of authorship verification methods to handle the different forms of expression in written and oral language will be highlighted.

Based on a new corpus in English, we provide cross-DT authorship verification cases using the following DTs:

Essays (written discourse)
Emails (written discourse)
Interviews (spoken discourse)
Speech transcriptions (spoken discourse)

The corpus comprises texts of around 100 individuals. All individuals have similar age (18-22) and are native English speakers. The topic of text samples is not restricted while the level of formality can vary within a certain DT (e.g., text messages may be addressed to family members or non-familial acquaintances).

Data

Download instructions: Please request access to the dataset by following the 'Request Item' link at the bottom of the FoLD webpage. You will need to set up an account on FoLD, a process which takes less than a minute. Once you are logged in, ignore the information on 'How to make a submission' and go to the Data link instead. When filling in the data request form, please put 'PAN 2023' in the Title field, and 'PAN 2023 Authorship Verification Task' as Description of proposed use of the data. Select 'No' for the Ethical approval option. In the Data Protection section, please put 'PAN 2023 Competition' for the Dissemination and Other information fields and if you do not have a link to your institution's data protection policy, simply put your institution's landing page URL

Dataset description:

The train (calibration) and test datasets consist of pairs of texts belonging to two different DTs. Each pair was assigned a unique identifier and we distinguish between same-author pairs and different-authors pairs. Additionally, we offer metadata on the discourse type for each text in the pair. Both training and test datasets are structured in the same way and have similar properties. However, their author sets are not overlapping.

Since the text length of texts of emails and interviews can be very small, each text belonging to these DTs is actually a concatenation of different messages. We use the <new> tag to denote boundaries of original messages. New lines within a text are denoted with the <nl>tag. In addition, author-specific and topic-specific information, e.g. named entities, has been replaced with corresponding tags. In spoken discourse types, additional tags are used to indicate nonverbal vocalisations (e.g, cough, laugh).

The training dataset comes with two newline delimited JSON files encoded in UTF8. The first file pairs.jsonl contains pairs of texts (each pair has a unique ID) and their discourse type labels:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "discourse_type": ["essay", "email"], "pair": ["Text 1...", "Text 2..."]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "discourse_type": ["essay", "interview"], "pair": ["Text 3...", "Text 4..."]}
...

The second file, truth.jsonl, contains the ground truth for all pairs. The ground truth is composed of a boolean flag indicating if texts in a pair are from the same author and the author IDs:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "same": true, "authors": ["1446633", "1446633"]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "same": false, "authors": ["1535385", "1998978"]}
...

Data and ground truth are in the same order and can be ingested line-wise in parallel without the need for a reshuffle based on the pair ID. The discourse type labels will be given in both the training and testing datasets. The ground truth file will only be available for the training data.

The development of the 100 Idiolects corpus was supported by a Research England Expanding Excellence in England (E3) grant awarded to the Aston Institute for Forensic Linguistics. This dataset should not to be used for any other research purposes without permission from the Aston Institute for Forensic Linguistics.

Baselines [code]

We provide the following baseline methods:

A simple method that calculates the cosine similarities between TFIDF-normalized, bag-of-character-tetragrams representations of the texts in a pair. The resulting scores are then shifted using a simple grid search, to arrive at an optimal performance on the calibration data.
A simple method based on text compression that given a pair of texts calculates the cross-entropy of text2 using the Prediction by Partial Matching model of text1 and vice-versa. Then, the mean and absolute difference of the two cross-entropies are used by a logistic regression model to estimate a verification score in [0,1].

Note that the above baseline methods do not make use of the available discourse type information: participants are highly encouraged to exploit this useful metadata in their submissions.

Evaluation [code]

Systems will be compared and ranked on the basis of five, complementary metrics:

AUC: the conventional area-under-the-ROC-curve score, as implemented in scikit-learn.
F1-score: the well-known performance measure (not taking into account non-answers), as implemented in scikit-learn.
c@1: a variant of the conventional F1-score, which rewards systems that leave difficult problems unanswered (i.e. scores of exactly 0.5), introduced by Peñas and Rodrigo (2011).
F_0.5u: a measure that puts more emphasis on deciding same-author cases correctly (Bevendorff et al. 2019)
Brier: the complement of the well-known Brier score, for evaluating the goodness of (binary) probabilistic classififiers, as implemented in scikit-learn.

These measures are complementary and provide a holistic way to assess a system's performance: c@1 measures the accuracy of binary predictions but also the ability of systems to leave difficult cases unanswered; AUC measures the ability of systems to assign higher scores to positive cases in comparison to negative cases; Brier measures the ability of systems to calibrate the verification scores as probability of a positive answer, etc.

A reference evaluation script is made available [code] . Systems will be applied to a pairs-file and are expected to produce a single jsonl-file as result. In this prediction file, each separate line should contain a valid json-string that provides the ID of a pair in the pairs-file and a "value" field, with a floating-point score that is bounded (0 >= score <= 1), indicating the probability that this pair is a same-author text pair. Systems are allowed to leave some problems unanswered: in such cases, the answer can be left out from the prediction file OR its value should be set to exactly 0.5. All answers that have a value of 0.5 will be considered non-decisions, which will affect some (but not all) evaluation metrics.

Submission

Each participating team is allowed the submission of at most three runs. The participants should indicate in their notebook paper the submitted system settings for each run.

Submissions are not expected to be trained on TIRA (i.e. there will be no new calibration dataset for the testing phase). The submisisons should therefore contain already fully calibrated models that should only be deployed on TIRA for the actual testing.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the EVALUATION-DIRECTORY and (ii) an absolute path to an OUTPUT-DIRECTORY:

 > mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, a single jsonl-formatted file will be included (pairs.jsonl), containing the text pairs (analogously to the calibration data that was released). The answers should be written to a jsonl-formatted file (answers.jsonl) under OUTPUT-DIRECTORY. Each line should contain a single json-formatted answer, using the following syntax:

{"id": "c04fdf1e-ddf5-5542-96e7-13ce18cae176", "value": 0.4921}
                    {"id": "49dc4cae-3d32-5b4d-b240-a080a1dbb659", "value": 0.5}
                    {"id": "f326fe7c-fc10-566f-a70f-0f36e3f92399", "value": 0.5}
                    {"id": "16daa0d1-61b8-5650-b7ee-5e265bd40910", "value": 0.9333}
                    {"id": "08b536a8-4fed-5f62-97bb-e57f79e841d2", "value": 0.0751}
...

Note: Each verification problem should be solved independently of other problems in the collection.

We ask you to prepare your software in a docker container and submit it to TIRA, where you can self-evaluate your software on the test data. See the TIRA Guides for mor details on the submission procedure.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Efstathios Stamatatos, Mike Kestemont, Krzysztof Kredens, Piotr Pezik, Annina Heini, Janek Bevendorff, Benno Stein, and Martin Potthast. Overview of the Authorship Verification Task at PAN 2022. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, CLEF 2023 Labs and Workshops, Notebook Papers, September 2023.
Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Janek Bevendorff, Martin Potthast, and Benno Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2021. Working notes of CLEF 2021 - Conference and Labs of the Evaluation Forum.
Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Martin Potthast and Benno Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020. Working notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
Efstathios Stamatatos. Authorship Verification: A Review of Recent Advances. Research in Computer Science, 123, pp. 9-25, 2016.
Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, 2008.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, 2009.
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, 2009.
A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse. In Proc. of the 49th Annual Meeting of the Association for. Computational Linguistics, Vol. 1, pages 1415-1424, 2011.
Bevendorff et al. Generalizing Unmasking for Short Texts, Proceedings of NAACL, pages 654-659, 2019.
Kredens, K., Heini, A. and Pezik, P. 100 Idiolects - a corpus for research on individual variation across discourse types. Aston University: FoLD Repository, 2021.