Authorship Clustering 2016

Synopsis
Introduction
Task
Development Phase
Evaluation Phase
Output
Performance Measures
Related Work
Task Committee

Synopsis

Task: Given a document collection, cluster the texts by authorship.
Input: [data]
Output: [example] [schema] [verifier]
Evaluation: [code]
Submission: [submit]

Introduction

Authorship attribution is an important problem in information retrieval and computational linguistics but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may be able to save lives. The most common framework for testing candidate algorithms is the closed-set attribution task: given known sample documents from a small, finite set of candidate authors, which wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding task is author clustering where given a document collection the task is to group documents written by the same author so that each cluster corresponds to a different author. This task can also be viewed as establishing authorship links between documents.

Note that a variation of author clustering was included in the PAN-2012 edition. However, it was focused on the paragraph-level and therefore it is more related to the author diarization task (see details in the right column). In PAN-2016, we focus on document-level author clustering. The task of authorship verification studied in detail in previous editions of PAN (2013-2015) is strongly associated with author clustering since any clustering problem can be decomposed into a series of author verification problems. We encourage participants to attempt to modify authorship verification approaches to deal with the author clustering task.

In this edition of PAN we aim to study two application scenarios:

Complete author clustering: This scenario requires a detailed analysis where the number (k) of different authors found in the collection should be identified and each document should be assigned to exactly one of the k clusters (each cluster corresponds to a different author). In the following illustrating example, 4 different authors are found and the colour of each document indicates its author.

Authorship-link ranking: This scenario views the exploration of the given document collection as a retrieval task. It aims at establishing authorship links between documents and provides a list of document pairs ranked according to a confidence score (the score shows how likely it is the document pair to be by the same author). In the following example, 4 document pairs with similar authorship are found and then these authorship-links are sorted according to their similarity.

Task

Given a collection of (up to 100) documents, identify authorship links and groups of documents by the same author. All documents are single-authored, in the same language, and belong to the same genre. However, the topic or text-length of documents may vary. The number of distinct authors whose documents are included in the collection is not given.

Development Phase

To develop your software, we provide you with a training corpus that comprises a set of author clustering problems in 3 languages (English, Dutch, and Greek) and 2 genres (newspaper articles and reviews). Each problem consists of a set of documents in the same language and genre. However, their topic may differ and the document lengths vary from a few hundred to a few thousand words.

The documents of each problem are located in a separate folder. The file info.json describes all required information for each clustering problem. In more detail, the language (either "en", "nl", or "gr" for English, Dutch and Greek, respectively), genre (either "articles" or "reviews"), and the folder of each problem (relative path).

[
   {"language": "en", "genre": "articles", "folder": "problem001"},
   ...
]

The ground truth data of the training corpus consists of two files for each clustering problem: clustering.json and ranking.json similar to the files described in the Output section (see details below). All ground truth files are located in the truth folder.

Evaluation Phase

Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Output

Your system should produce two output files in JSON:

One output file including complete information about the detected clusters named clustering.json. Each cluster should contain all documents found in the collection by a specific author. A JSON file of the following format should be produced (a list of clusters, each cluster is a list of documents):
```
[
	[
		{"document":  "filename1"},
		{"document":  "filename2"},
	…
	],
…
]
```
The clusters should be non-overlapping, thus each filename should belong to exactly one cluster.
One output file named ranking.json including a list of document pairs ranked according to a real-valued score in [0,1], where higher values denote more confidence that the pair of documents are by the same author. A JSON file of the following format should be produced (a list of document pairs and a real-valued number):

[
	{"document1": "filename1",
	 "document2": "filename2",
	 "score": real-valued-number},
	…
]

The order of documents within a pair is not important (e.g. "document1": "filename1", "document2": "filename2" is the same with "document2": "filename1", "document1": "filename2"). In case the same pair is reported more than once the first occurrence will be taken into account.

An illustrating example follows. Let’s assume that a document collection of 6 files is given: file1.txt, file2.txt, file3.txt, file4.txt, file5.txt, and file6.txt. There are 3 clusters: (i) file1.txt, file3.txt, and file4.txt are by the same author, (ii) file5.txt and file6.txt are by another author and (iii) file2.txt is by yet another author.

The output file in JSON for the complete author clustering task should be:

[   [	{"document": "file1.txt"},
		{"document": "file3.txt"},
		{"document": "file4.txt"}	],
	[
		{"document": "file5.txt"},
		{"document": "file6.txt"}	],
	[
		{"document": "file2.txt"}	]
]

An example of the output file for authorship-link ranking could be:

[	{"document1": "file1.txt",
	 "document2": "file4.txt",
	 "score": 0.95},

	{"document1": "file3.txt",
	 "document2": "file4.txt",
	 "score": 0.75},

	{"document1": "file5.txt",
	 "document2": "file6.txt",
	 "score": 0.66},

	{"document1": "file1.txt",
	 "document2": "file3.txt",
	 "score": 0.63}
]

Performance Measures

The clustering output will be evaluated according to BCubed F-score (Amigo et al. 2007)
The ranking of authorship links will be evaluated according to Mean Average Precision (Manning et al. 2008)

For your convenience, we provide an evaluator script written in Octave.

It takes three parameters: (-i) an input directory (the data set including a 'truth' folder), (-a) an answers directory (your software output) and (-o) an output directory where the evaluation results are written to. Of course, you are free to modify the script according to your needs.