Author Identification
2017

This task is divided into author clustering and style breach detection. You can choose to solve one or both of them.

Author Clustering

Authorship attribution is an important problem in information retrieval and computational linguistics but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may be able to save lives. The most common framework for testing candidate algorithms is the closed-set attribution task: given known sample documents from a small, finite set of candidate authors, which wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding task is author clustering where given a document collection the task is to group documents written by the same author so that each cluster corresponds to a different author. This task can also be viewed as establishing authorship links between documents.

Note that the task of authorship verification studied in detail in previous editions of PAN (2013-2015) is strongly associated with author clustering since any clustering problem can be decomposed into a series of author verification problems. We encourage participants to attempt to modify authorship verification approaches to deal with the author clustering task.

PAN-2016 first studied the author clustering task focusing on relatively long documents like articles and reviews. In this edition of PAN, we focus on short documents of paragraph length. The aim is to cluster paragraphs that may be extracted from the same document or different documents and are by the same author. Such a task is closely related to author diarization and intrinsic plagiarism detection.

Similar to PAN-2016 edition of the task, two application scenarios are examined:

  • Complete author clustering: This scenario requires a detailed analysis where the number (k) of different authors found in the collection should be identified and each document should be assigned to exactly one of the k clusters (each cluster corresponds to a different author). In the following illustrating example, 4 different authors are found and the colour of each document indicates its author.
  • Authorship-link ranking: This scenario views the exploration of the given document collection as a retrieval task. It aims at establishing authorship links between documents and provides a list of document pairs ranked according to a confidence score (the score shows how likely it is the document pair to be by the same author). In the following example, 4 document pairs with similar authorship are found and then these authorship-links are sorted according to their similarity.
Task
Given a collection of (up to 50) short documents (paragraphs extracted from larger documents), identify authorship links and groups of documents by the same author. All documents are single-authored, in the same language, and belong to the same genre. However, the topic or text-length of documents may vary. The number of distinct authors whose documents are included in the collection is not given.
Training Phase

To develop your software, we provide you with a training corpus that comprises a set of author clustering problems in 3 languages (English, Dutch, and Greek) and 2 genres (newspaper articles and reviews). Each problem consists of a set of documents in the same language and genre. However, their topic may differ and the document lengths vary from a few hundred to a few thousand words.

The documents of each problem are located in a separate folder. The file info.json describes all required information for each clustering problem. In more detail, the language (either "en", "nl", or "gr" for English, Dutch and Greek, respectively), genre (either "articles" or "reviews"), and the folder of each problem (relative path).

[
   {"language": "en", "genre": "articles", "folder": "problem001"},
   ...
]
	

The ground truth data of the training corpus consists of two files for each clustering problem: clustering.json and ranking.json similar to the files described in the Output section (see details below). All ground truth files are located in the truth folder.

Download corpus (Updated: Feb 17, 2017)

Evaluation Phase
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Output

Your system should produce two output files in JSON:

  • One output file including complete information about the detected clusters named clustering.json. Each cluster should contain all documents found in the collection by a specific author. A JSON file of the following format should be produced (a list of clusters, each cluster is a list of documents):
  • [
    	[
    		{"document":  "filename1"},
    		{"document":  "filename2"},
    	…
    	],
    …
    ]
    

    The clusters should be non-overlapping, thus each filename should belong to exactly one cluster.

  • One output file named ranking.json including a list of document pairs ranked according to a real-valued score in [0,1], where higher values denote more confidence that the pair of documents are by the same author. A JSON file of the following format should be produced (a list of document pairs and a real-valued number):
  • [
    	{"document1": "filename1",
    	 "document2": "filename2",
    	 "score": real-valued-number},
    	…
    ]
    

    The order of documents within a pair is not important (e.g. "document1": "filename1", "document2": "filename2" is the same with "document2": "filename1", "document1": "filename2"). In case the same pair is reported more than once the first occurrence will be taken into account.

An illustrating example follows. Let’s assume that a document collection of 6 files is given: file1.txt, file2.txt, file3.txt, file4.txt, file5.txt, and file6.txt. There are 3 clusters: (i) file1.txt, file3.txt, and file4.txt are by the same author, (ii) file5.txt and file6.txt are by another author and (iii) file2.txt is by yet another author.

  • The output file in JSON for the complete author clustering task should be:
  • [   [	{"document": "file1.txt"},
    		{"document": "file3.txt"},
    		{"document": "file4.txt"}	],
    	[
    		{"document": "file5.txt"},
    		{"document": "file6.txt"}	],
    	[
    		{"document": "file2.txt"}	]
    ]
    
  • An example of the output file for authorship-link ranking could be:
  • [	{"document1": "file1.txt",
    	 "document2": "file4.txt",
    	 "score": 0.95},
    
    	{"document1": "file3.txt",
    	 "document2": "file4.txt",
    	 "score": 0.75},
    
    	{"document1": "file5.txt",
    	 "document2": "file6.txt",
    	 "score": 0.66},
    
    	{"document1": "file1.txt",
    	 "document2": "file3.txt",
    	 "score": 0.63}
    ]
    
Performance Measures
  • The clustering output will be evaluated according to BCubed F-score (Amigo et al. 2007)
  • The ranking of authorship links will be evaluated according to Mean Average Precision (Manning et al. 2008)

For your convenience, we provide an evaluator script written in Octave.

Download script

It takes three parameters: (-i) an input directory (the data set including a 'truth' folder), (-a) an answers directory (your software output) and (-o) an output directory where the evaluation results are written to. Of course, you are free to modify the script according to your needs.

Submission

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the evaluation corpus and (ii) an absolute path to an empty output directory:

> mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY
	

Within EVALUATION-DIRECTORY a info.json file and a number of folders, one for each clustering problem, will be found (similar to the training corpus as described above). For each clustering problem, a new folder should be built in OUTPUT-DIRECTORY using the same folder name found in info.json for that problem and within that folder the clustering.json and ranking.json output files should be written (similar to the truth folder of the training corpus).

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

We refer you to:

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

Task Committee

Walter Daelemans

Walter Daelemans

University of Antwerp

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Ben Verhoeven

Ben Verhoeven

University of Antwerp

Style Breach Detection

While many approaches target the problem of identifying authors of whole documents, research on investigating multi-authored documents is sparse. To narrow the gap, the author diarization task of the PAN-2016 edition already focused on collaboratively written documents, attempting to cluster text by authors within documents. This year we modify the problem by asking participants to detect style breaches within documents, i.e., to locate borders where authorships change.

The problem is therefore related to the text segmentation problem, with the difference that the latter usually focus on detecting switches of topics or stories. In contrast to that, this task aims to find borders based on the writing style, disregarding the specific content. As the goal is to only find borders, it is irrelevant to identify or cluster authors of segments. A simple example consisting of four breaches of style (switches / borders) is illustrated below:

Task

Given a document, determine whether it is multi-authored, and if yes, find the borders where authors switch.

All documents are provided in English and may contain zero up to arbitrarily many switches (style breaches). Thereby switches of authorships may only occur at the end of sentences, i.e., not within.

Training Phase

To develop your algorithms, a training data set including corresponding solutions is provided.

Download corpus (Updated: Feb 15, 2017)

For each problem instance X, two files are provided:

  • problem-X.txt contains the actual text
  • problem-X.truth contains the ground truth, i.e., the correct solution in JSON format:
    {
        "borders": [
            character_position_border_1,
            character_position_border_2,
            …
        ]
    }
    

    To identify a border, the absolute character position of the first non-whitespace character of the new segment is used. The document starts at character position 0. An example that could match the borders of the image above could look as follows:

    {
        "borders": [1709, 3119, 3956, 5671]
    }
    

    An empty array indicates that the document is single-authored, i.e., contains no style switches:

    {
        "borders": []
    }
    
Evaluation Phase
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Output

In general, the data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing. Thus, for each given problem problem-X.txt your software should output the missing solution file problem-X.truth. The output syntax should thereby be exactly like it is described in the training phase section.

Performance Measures

To evaluate the predicted style breaches, two metrics will be used:

  • the WindowDiff metric (Pevzner, Hearst, 2002) was proposed for general text segmentation evaluation and is still the de facto standard for such problems. It gives an error rate (between 0 to 1, 0 indicating a perfect prediction) for predicting borders by penalizing near-misses less than other/complete misses or extra borders.
  • a more recent adaption of the WindowDiff metric is the WinPR metric (Scaiano, Inkpen, 2012). It enhances it by computing the common information retrieval measures precision (WinP) and recall (WinR) and thus allows to give a more detailed, qualitative statement about the prediction. For the final ranking of all participating teams, the F-score of WinPR will be used.

Note that while both metrics will be computed on a word-level, you still have to provide your solutions on a character-level (delegating the tokenization to the evaluator).

For your convenience, we provide the evaluator script written in Python.

Download script

It takes three parameters: an input directory (the data set), an inputRun directory (your computed breaches) and an output directory where the results file is written to. Of course, you are free to modify the script according to your needs.

Submission

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the evaluation corpus and (ii) an absolute path to an empty output directory:

> mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, you will find a list of problem instances, i.e., [filename].txt files. For each problem instance you should produce the solution file [filename].truth in the OUTPUT-DIRECTORY For instance, you read EVALUATION-DIRECTORY/problem-12.txt, process it and write your results to OUTPUT-DIRECTORY/problem-12.truth.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

We refer you to:

Michael Tschuggnall

Michael Tschuggnall

University of Innsbruck

Task Committee

Günther Specht

Günther Specht

University of Innsbruck

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar