Author Identification
2016

Adobe
Sponsor

This task is divided into author clustering and author diarization. You can choose to solve one or both of them.

Author Clustering

Authorship attribution is an important problem in information retrieval and computational linguistics but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may be able to save lives. The most common framework for testing candidate algorithms is the closed-set attribution task: given known sample documents from a small, finite set of candidate authors, which wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding task is author clustering where given a document collection the task is to group documents written by the same author so that each cluster corresponds to a different author. This task can also be viewed as establishing authorship links between documents.

Note that a variation of author clustering was included in the PAN-2012 edition. However, it was focused on the paragraph-level and therefore it is more related to the author diarization task (see details in the right column). In PAN-2016, we focus on document-level author clustering. The task of authorship verification studied in detail in previous editions of PAN (2013-2015) is strongly associated with author clustering since any clustering problem can be decomposed into a series of author verification problems. We encourage participants to attempt to modify authorship verification approaches to deal with the author clustering task.

In this edition of PAN we aim to study two application scenarios:

  • Complete author clustering: This scenario requires a detailed analysis where the number (k) of different authors found in the collection should be identified and each document should be assigned to exactly one of the k clusters (each cluster corresponds to a different author). In the following illustrating example, 4 different authors are found and the colour of each document indicates its author.
  • Authorship-link ranking: This scenario views the exploration of the given document collection as a retrieval task. It aims at establishing authorship links between documents and provides a list of document pairs ranked according to a confidence score (the score shows how likely it is the document pair to be by the same author). In the following example, 4 document pairs with similar authorship are found and then these authorship-links are sorted according to their similarity.
Task
Given a collection of (up to 100) documents, identify authorship links and groups of documents by the same author. All documents are single-authored, in the same language, and belong to the same genre. However, the topic or text-length of documents may vary. The number of distinct authors whose documents are included in the collection is not given.
Training Phase

To develop your software, we provide you with a training corpus that comprises a set of author clustering problems in 3 languages (English, Dutch, and Greek) and 2 genres (newspaper articles and reviews). Each problem consists of a set of documents in the same language and genre. However, their topic may differ and the document lengths vary from a few hundred to a few thousand words.

The documents of each problem are located in a separate folder. The file info.json describes all required information for each clustering problem. In more detail, the language (either "en", "nl", or "gr" for English, Dutch and Greek, respectively), genre (either "articles" or "reviews"), and the folder of each problem (relative path).

[
   {"language": "en", "genre": "articles", "folder": "problem001"},
   ...
]
	

The ground truth data of the training corpus consists of two files for each clustering problem: clustering.json and ranking.json similar to the files described in the Output section (see details below). All ground truth files are located in the truth folder.

Download corpus

Evaluation Phase
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Output

Your system should produce two output files in JSON:

  • One output file including complete information about the detected clusters named clustering.json. Each cluster should contain all documents found in the collection by a specific author. A JSON file of the following format should be produced (a list of clusters, each cluster is a list of documents):
  • [
    	[
    		{"document":  "filename1"},
    		{"document":  "filename2"},
    	…
    	],
    …
    ]
    

    The clusters should be non-overlapping, thus each filename should belong to exactly one cluster.

  • One output file named ranking.json including a list of document pairs ranked according to a real-valued score in [0,1], where higher values denote more confidence that the pair of documents are by the same author. A JSON file of the following format should be produced (a list of document pairs and a real-valued number):
  • [
    	{"document1": "filename1",
    	 "document2": "filename2",
    	 "score": real-valued-number},
    	…
    ]
    

    The order of documents within a pair is not important (e.g. "document1": "filename1", "document2": "filename2" is the same with "document2": "filename1", "document1": "filename2"). In case the same pair is reported more than once the first occurrence will be taken into account.

An illustrating example follows. Let’s assume that a document collection of 6 files is given: file1.txt, file2.txt, file3.txt, file4.txt, file5.txt, and file6.txt. There are 3 clusters: (i) file1.txt, file3.txt, and file4.txt are by the same author, (ii) file5.txt and file6.txt are by another author and (iii) file2.txt is by yet another author.

  • The output file in JSON for the complete author clustering task should be:
  • [   [	{"document": "file1.txt"},
    		{"document": "file3.txt"},
    		{"document": "file4.txt"}	],
    	[
    		{"document": "file5.txt"},
    		{"document": "file6.txt"}	],
    	[
    		{"document": "file2.txt"}	]
    ]
    
  • An example of the output file for authorship-link ranking could be:
  • [	{"document1": "file1.txt",
    	 "document2": "file4.txt",
    	 "score": 0.95},
    
    	{"document1": "file3.txt",
    	 "document2": "file4.txt",
    	 "score": 0.75},
    
    	{"document1": "file5.txt",
    	 "document2": "file6.txt",
    	 "score": 0.66},
    
    	{"document1": "file1.txt",
    	 "document2": "file3.txt",
    	 "score": 0.63}
    ]
    
Performance Measures
  • The clustering output will be evaluated according to BCubed F-score (Amigo et al. 2007)
  • The ranking of authorship links will be evaluated according to Mean Average Precision (Manning et al. 2008)
Submission

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the evaluation corpus and (ii) an absolute path to an empty output directory:

> mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY
	

Within EVALUATION-DIRECTORY a info.json file and a number of folders, one for each clustering problem, will be found (similar to the training corpus as described above). For each clustering problem, a new folder should be built in OUTPUT-DIRECTORY using the same folder name found in info.json for that problem and within that folder the clustering.json and ranking.json output files should be written (similar to the truth folder of the training corpus).

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

We refer you to:

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

Task Committee

Walter Daelemans

Walter Daelemans

University of Antwerp

Patrick Juola

Patrick Juola

Duquesne University

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Ben Verhoeven

Ben Verhoeven

University of Antwerp

Author Diarization or Intrinsic Plagiarism Detection

The term author diarization originates from the research field speaker diarization, where approaches try to automatically identify and cluster different speakers of an audio speech signal like a telephone conversation or a political TV debate by processing and analyzing the audio frequency signal (an overview of approaches can be found, for example, here).

Similar to such approaches, the task of author diarization in this PAN edition is to identify different authors within a single document. Such documents may be the result of a collaborative work (e.g., a combined master thesis written by two students, a scientific paper written by four authors, …), or the result of plagiarism. The latter is thereby a special case, where it can be assumed that the main text is written by one author and only some fragments are by other writers (the plagiarized or intrusive sections). On the other hand, the contributions of a collaboratively written document may be equally weighted, i.e., each author contributes to the same extent.

Task

Given a document, identify and group text fragments that correspond to individual authors. Similarly to the situation in speaker diarization approaches, where active speakers may change at any time, you cannot assume that changes in authorship occur, for example, only on paragraph boundaries. But you should rather be prepared to detect different authors at any text position. An example could be as follows:

"She is also a successful businesswoman and an American icon, was born in Jersey City to middle-class Polish-American parents and she earned a partial scholarship to …"

Nevertheless, you may use paragraph boundaries or other useful metrics as heuristic to potential changes.

To cover different variants of the problem, the task of this years PAN edition is split into three subproblems. For this year’s edition, all documents are provided in English.

  • Traditional intrinsic plagiarism detection: here, you can assume that there exists one main author who wrote at least 70% of the text. Up to the other 30% may be written by other authors. For this problem, you should build exactly two clusters: one containing the text fragments of the main author, and the other one containing the intrusive fragments.


  • Diarization with a given number (n) of authors: given a document, the task is to build exactly n clusters containing the contributions of the different writers. Thereby, each author may have contributed to an arbitrary, but non-zero extent.

  • Diarization with an unknown number of authors: finally, this variant covers the most challenging task, which is similar to the previous task, but without the information of knowing how many authors contributed to the document.

Training Phase

To develop your algorithms, training data sets for each sub problem including corresponding solutions are provided.


Download corpus


The data set consists of three folders corresponding to each subtask. For each problem instance X in each subtask, three files are provided:

  • problem-X.txt contains the actual text
  • problem-X.meta contains meta information about the file in JSON format. It contains the"language" (which is always English this year), the problem "type" ("plagiarism" or "diarization") and for the diarization problem with given number of authors additionally the correct number of authors ("numAuthors")
  • problem-X.truth contains the ground truth, i.e., the correct solution in JSON format:
    {
        "authors": [
            [
                {"from": fromCharPosition,
                "to": toCharPosition},
                …
            ],
            …
        ]
    }
    

    To identify the text fragments, the absolute character start/end positions within the document are used, whereby the document starts at character position 0.

    Note that for simplicity reasons the solutions for the intrinsic plagiarism task contains exactly 2 clusters: one for the main author and one combined for all other authors. Nevertheless, when producing the output, you are free to create as many clusters as you wish for the plagiarized sections.

    An example for an intrinsic plagiarism detection solution could look like this:

    {
    	"authors": [
    		[
    			{"from": 314, "to": 15769}
    		],
    		[
    			{"from": 0, "to": 313},
    			{"from": 15770, "to": 19602}
    		]
    	]
    }
    
    An example of the diarization solution of a document that was written by four authors could then look like this:

    {
        "authors": [
            [
                {"from": 123, "to": 400},
                {"from": 598, "to": 680}
            ],
            [
                {"from": 0, "to": 122}
            ],
            [
                {"from": 401, "to": 597},
                {"from": 681, "to": 1020},
                {"from": 1101, "to": 1400}
            ],
            [
                {"from": 1021, "to": 1100}
            ]
        ]
    }
    

    Of course, in the actual evaluation phase the ground truth, i.e., the problem-X.truth file will be missing.

Evaluation Phase
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Output

In general, the data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing. This means, you can also use the information provided in the problem-X.meta file. Your software should finally output the missing solution file problem-X.truth for every problem instance X in the respective output folder (see Submission). The output syntax should thereby be exactly like it is used in the training phase.

In general, there is no difference in the output between the intrinisic plagiarism detection and the diarization subtasks. Moreover, the order of the entries is not relevant.

In the following, we provide you with some examples for both subtasks:

  • For the intrinsic plagiarism detection subtask, you should create one entry for the main author. For the plagiarized sections you are free to either combine them into one entry (like it is done in the training data) or split them into more entries. As an example, if you found 2 plagiarized sections in the file problem-3.txt, you should produce the file problem-3.truth, where both

    {
    	"authors": [
    		[
    			{"from": 314, "to": 15769}
    		],
    		[
    			{"from": 0, "to": 313},
    			{"from": 15770, "to": 19602}
    		]
    	]
    }
    

    and

    {
    	"authors": [
    		[
    			{"from": 314, "to": 15769}
    		],
    		[
    			{"from": 0, "to": 313}
            ],
    			{"from": 15770, "to": 19602}
    		]
    	]
    }
    

    are valid solutions.

  • For the diarization subtask, if you found 3 authors for the file problem-12.txt, you should produce the file problem-12.truth containing the solution like this:

    {
    	"authors": [
    		[
    			{"from": 0, "to": 409},
    			{"from": 645, "to": 4893}
    		],
    		[
    			{"from": 410, "to": 644},
    			{"from": 4894, "to": 6716}
    		],
    		[
    			{"from": 6717, "to": 15036}
    		]
    	]
    }
    
Performance Measures
  • To evaluate the quality of the intrinsic plagiarism detection algorithms, the micro- and macro-averaged F-score will be used (see PAN@CLEF'11, or this paper) will be used.
  • For the diarization algorithms, the BCubed F-score (Amigo et al. 2007) will be used.
Submission

We ask you to prepare your software so that it can be executed via command line calls. The command shall take as input (i) an absolute path to the directory of the evaluation corpus and (ii) an absolute path to an empty output directory:

> mySoftware -i EVALUATION-DIRECTORY -o OUTPUT-DIRECTORY

Within EVALUATION-DIRECTORY, you will find a list of folders. Exactly as in the training data, each folder then contains [filename].txt and corresponding [filename].meta files. You should use the latter to determine which problem type you should solve (plagiarism detection or diarization).

For each folder, you should produce a folder in the OUTPUT-DIRECTORY with the same name. Finally, your software should write the [filename].truth file for each [filename] inside these directories.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work

We refer you to:

Michael Tschuggnall

Michael Tschuggnall

University of Innsbruck

Task Committee

Günther Specht

Günther Specht

University of Innsbruck

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar