Authorship Verification 2022

Synopsis
Task
Related Work
Task Committee

Synopsis

Task: Given two texts belonging to different Discourse Types (DT), determine if they are written by the same author (cross-DT authorship verification).
Input: Pairs of texts from the Aston 100 Idiolects Corpus in English covering the following DTs: essays, emails, text messages, and business memos. [data]
22,922 document pairs from 112 Authors
Evaluation: AUC, F1, c@1, F_0.5u, Brier [code]
Submission: Deployment on TIRA [submit]
Baseline: TFIDF-weighted char n-grams cosine similarity; compression method calculating cross-entropy [code]
Results (7 Participants)
Best approach: T5-Embeddings of Text and Stylistic Features [Maryam Najafi and Ehsan Tavan*]

Task

Authorship verification is the task of deciding whether two texts have been written by the same author based on comparing the texts' writing styles. In previous editions of PAN, we explored the effectiveness of authorship verification technology in several languages and text genres. In the two most recent editions, cross-domain authorship verification using fanfiction texts was examined. Despite certain differences between fandoms, the task of cross-fandom authorship verification has proved to be relatively feasible. In the current edition, we focus on more challenging scenarios where each author verification case considers two texts that belong to different DTs (cross-DT authorship verification). This will allow us to study the ability of stylometric approaches to capture authorial characteristics that remain stable across DTs even when very different forms of expression are imposed by the DT norms.

Based on a new corpus in English, we provide cross-DT authorship verification cases using the following DTs:

Essays
Emails
Text messages
Business memos

The corpus comprises texts of around 100 individuals. All individuals have similar age (18-22) and are native English speakers. The topic of text samples is not restricted while the level of formality can vary within a certain DT (e.g., text messages may be addressed to family members or non-familial acquaintances).

Data [download]

The train (calibration) and test datasets consist of pairs of texts belonging to two different DTs. Each pair was assigned a unique identifier and we distinguish between same-author pairs and different-authors pairs. Additionally, we offer metadata on the discourse type for each text in the pair. Both training and test datasets are structured in the same way, but their author sets are not overlapping.

Since the text length of texts of emails and text messages can be very small, each text belonging to these DTs is actually a concatenation of different messages. We use the <new> tag to denote boundaries of original messages. New lines within a text are denoted with the <nl>tag. In addition, author-specific and topic-specific information, e.g. named entities, has been replaced with corresponding tags.

The training dataset comes with two newline delimited JSON files encoded in UTF8. The first file pairs.jsonl contains pairs of texts (each pair has a unique ID) and their discourse type labels:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "discourse_type": ["essay", "email"], "pair": ["Text 1...", "Text 2..."]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "discourse_type": ["email", "text_message"], "pair": ["Text 3...", "Text 4..."]}
...

The second file, truth.jsonl, contains the ground truth for all pairs. The ground truth is composed of a boolean flag indicating if texts in a pair are from the same author and the author IDs:

{"id": "6cced668-6e51-5212-873c-717f2bc91ce6", "same": true, "authors": ["1446633", "1446633"]}
                        {"id": "ae9297e9-2ae5-5e3f-a2ab-ef7c322f2647", "same": false, "authors": ["1535385", "1998978"]}
...

Data and ground truth are in the same order and can be ingested line-wise in parallel without the need for a reshuffle based on the pair ID. The discourse type labels will be given in both the training and testing datasets. The ground truth file will only be available for the training data.

The development of the 100 Idiolects corpus was supported by a Research England Expanding Excellence in England (E3) grant awarded to the Aston Institute for Forensic Linguistics. This dataset should not to be used for any other research purposes without permission from the Aston Institute for Forensic Linguistics.

Baselines [code]

We provide the following baseline methods:

A simple method that calculates the cosine similarities between TFIDF-normalized, bag-of-character-tetragrams representations of the texts in a pair. The resulting scores are then shifted using a simple grid search, to arrive at an optimal performance on the calibration data.
A simple method based on text compression that given a pair of texts calculates the cross-entropy of text2 using the Prediction by Partial Matching model of text1 and vice-versa. Then, the mean and absolute difference of the two cross-entropies are used by a logistic regression model to estimate a verification score in [0,1].

Note that the above baseline methods do not make use of the available discourse type information: participants are highly encouraged to exploit this useful metadata in their submissions.

Evaluation [code]

Systems will be compared and ranked on the basis of five, complementary metrics:

AUC: the conventional area-under-the-ROC-curve score, as implemented in scikit-learn.
F1-score: the well-known performance measure (not taking into account non-answers), as implemented in scikit-learn.
c@1: a variant of the conventional F1-score, which rewards systems that leave difficult problems unanswered (i.e. scores of exactly 0.5), introduced by Peñas and Rodrigo (2011).
F_0.5u: a measure that puts more emphasis on deciding same-author cases correctly (Bevendorff et al. 2019)
Brier: the complement of the well-known Brier score, for evaluating the goodness of (binary) probabilistic classififiers, as implemented in scikit-learn.

These measures are complementary and provide a holistic way to assess a system's performance: c@1 measures the accuracy of binary predictions but also the ability of systems to leave difficult cases unanswered; AUC measures the ability of systems to assign higher scores to positive cases in comparison to negative cases; Brier measures the ability of systems to calibrate the verification scores as probability of a positive answer, etc.

A reference evaluation script is made available [code] . Systems will be applied to a pairs-file and are expected to produce a single jsonl-file as result. In this prediction file, each separate line should contain a valid json-string that provides the ID of a pair in the pairs-file and a "value" field, with a floating-point score that is bounded (0 >= score <= 1), indicating the probability that this pair is a same-author text pair. Systems are allowed to leave some problems unanswered: in such cases, the answer can be left out from the prediction file OR its value should be set to exactly 0.5. All answers that have a value of 0.5 will be considered non-decisions, which will affect some (but not all) evaluation metrics.

Submission

Each participating team is allowed the submission of at most three runs. The participants should indicate in their notebook paper the submitted system settings for each run.

Submissions are not expected to be trained on TIRA (i.e. there will be no new calibration dataset for the testing phase). The submisisons should therefore contain already fully calibrated models that should only be deployed on TIRA for the actual testing.

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the EVALUATION-DIRECTORY and (ii) an absolute path to an OUTPUT-DIRECTORY:

 > mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, a single jsonl-formatted file will be included (pairs.jsonl), containing the text pairs (analogously to the calibration data that was released). The answers should be written to a jsonl-formatted file (answers.jsonl) under OUTPUT-DIRECTORY. Each line should contain a single json-formatted answer, using the following syntax:

{"id": "c04fdf1e-ddf5-5542-96e7-13ce18cae176", "value": 0.4921}
                    {"id": "49dc4cae-3d32-5b4d-b240-a080a1dbb659", "value": 0.5}
                    {"id": "f326fe7c-fc10-566f-a70f-0f36e3f92399", "value": 0.5}
                    {"id": "16daa0d1-61b8-5650-b7ee-5e265bd40910", "value": 0.9333}
                    {"id": "08b536a8-4fed-5f62-97bb-e57f79e841d2", "value": 0.0751}
...

Note: Each verification problem should be solved independently of other problems in the collection.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide linked above.

Once deployed in your virtual machine, we ask you to access TIRA, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Janek Bevendorff, Martin Potthast, and Benno Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2021. Working notes of CLEF 2021 - Conference and Labs of the Evaluation Forum.
Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Martin Potthast and Benno Stein, Overview of the Cross-Domain Authorship Verification Task at PAN 2020. Working notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
Efstathios Stamatatos. Authorship Verification: A Review of Recent Advances. Research in Computer Science, 123, pp. 9-25, 2016.
Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, 2008.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, 2009.
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, 2009.
A. Peñas and A. Rodrigo. A Simple Measure to Assess Nonresponse. In Proc. of the 49th Annual Meeting of the Association for. Computational Linguistics, Vol. 1, pages 1415-1424, 2011.
Bevendorff et al. Generalizing Unmasking for Short Texts, Proceedings of NAACL, pages 654-659, 2019.
Kredens, K., Heini, A. and Pezik, P. 100 Idiolects - a corpus for research on individual variation across discourse types. Aston University: FoLD Repository, 2021.