Authorship attribution is an important problem in many areas including information retrieval and computational linguistics, but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may be able to save lives. The most common framework for testing candidate algorithms is a text classification problem: given known sample documents from a small, finite set of candidate authors, which if any wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding problem is author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. This may more accurately reflect real life in the experiences of professional forensic linguists, who are often called upon to answer this kind of question. It is the third year PAN focuses on the so-called author verification problem. The major difference with previous PAN editions is that this year we no longer consider cases where all texts within a verification problem are in the same genre or the same thematic area. We are focusing on cross-genre and cross-topic author verification, a more challenging version of the problem that better resembles real-world applications.
A note to forensic linguists: In order to bridge the gap between linguistics and computer science, we strongly encourage submissions from researchers from both fields. We understand that research groups with expertise in linguistics use manual or semi-automated methods and, therefore, they are not able to submit their software. To enable their participation, we will provide them with the opportunity to analyze the test corpus after the deadline of software submission (mid-April). Their results will be ranked in a separate list with respect to the performance of the software submissions and they will be entitled to describe their approach in a paper. In this framework, any scholar or research group with expertise in linguistics wishing to participate should contact the Task Chair.
Given a small set (no more than 5, possibly as few as one) of "known" documents by a single person and a "questioned" document, the task is to determine whether the questioned document was written by the same person who wrote the known document set. The genre and/or topic may differ significantly between the known and unknown documents.
To develop your software, we provide you with a training corpus that comprises a set of author verification problems in several languages/genres. Each problem consists of some (up to five) known documents by a single person and exactly one questioned document. All documents within a single problem instance will be in the same language. However, their genre and/or topic may differ significantly. The document lengths vary from a few hundred to a few thousand words.Download corpus (Update April 19, 2015)
The documents of each problem are located in a separate folder, the name of which (problem ID) encodes the language of the documents. The following list shows the available sub-corpora, including their language, type (cross-genre or cross-topic), code, and examples of problem IDs:
|Dutch||Cross-genre||DU||DU001, DU002, DU003, etc.|
|English||Cross-topic||EN||EN001, EN002, EN003, etc.|
|Greek||Cross-topic||GR||GR001, GR002, GR003, etc.|
|Spanish||Cross-genre||SP||SP001, SP002, SP003, etc.|
The ground truth data of the training corpus found in the file
one line per problem with problem ID and the correct binary answer (Y means the known and the
questioned documents are
by the same author and N means the opposite).
EN001 N EN002 Y EN003 N ...
Once you finished tuning your approach to achieve satisfying performance on the training corpus, your software will be tested on the evaluation corpus. During the competition, the evaluation corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the evaluation corpus will become available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
OutputYour software must take as input the absolute path to a set of problems. For each problem there is a separate sub-folder within that path including the set of known documents and the single unknown document of that problem (similarly to the training corpus). The software has to output a single text file
answers.txtwith all the produced answers for the whole set of evaluation problems. Each line of this file corresponds to a problem instance, it starts with the ID of the problem followed by a score, a real number in [0,1] inclusive, corresponding to the probability of a positive answer. That is, 0 means it is absolutely sure the questioned document is not by the author of the known documents, 1.0 means it is absolutely sure the questioned document and the known documents are by the same author, and 0.5 means that a positive and a negative answer are equally likely. The probability scores should be round with three decimal digits. Use a single whitespace to separate problem ID and probability score.
For example, an
answers.txtfile may look like this:
EN001 0.031 EN002 0.874 EN003 0.500 ...
The participants' answers will be evaluated according to the area under the ROC curve (AUC) of their probability scores.
In addition, the performance of the binary classification results (automatically extracted from probability scores where every score greater than 0.5 corresponds to a positive answer, every score lower than 0.5 corresponds to a negative answer, while 0.5 corresponds to an unanswered problem, or an "I don't know" answer) will be measured based on c@1 (Peñas & Rodrigo, 2011):
- c@1 = (1/n)*(nc+(nu*nc/n))
- n = #problems
- nc = #correct_answers
- nu = #unanswered_problems
Note: when positive/negative answers are provided for all available problems (probability scores different than 0.5), then c@1=accuracy. However, c@1 rewards approaches that maintain the same number of correct answers and decrease the number of incorrect answers by leaving some problems unanswered (when probability score equals 0.5).The final ranking of the participants will be based on the product of AUC and c@1.
We ask you to prepare your software so that it can be executed via command line calls. To maximize the sustainability of software submissions for this task, we encourage you to prepare your software so it can be re-trained on demand, i.e., by offering two commands, one for training, and one for testing. This way, your software can be reused on future evaluation corpora as well as on private collections submitted to PAN by via our data submission initiative.
The training command shall take as input (i) an absolute path to a training corpus formated as described above, and (ii) an absolute path to an empty output directory:
> myTrainingSoftware -i path/to/training/corpus -o path/to/output/directory
Based on the training corpus, and perhaps based on its language and genre found within, your software shall train a classification model, and save the trained model to the specified output directory in serialized or binary form.
The testing command shall take as input (i) an absolute path to a test corpus (not containing the ground truth) (ii) an absolute path to a previously trained classification model, and (iii) an absolute path to an empty output directory:
> myTestingSoftware -i path/to/test/corpus -m path/to/classification/model -o path/to/output/directory
Based on the classification model, the software shall classifiy each case found in the test corpus and write an output file as described above to the output directory.
However, offering a command for training is optional, so if you face difficulties in doing so, you may skip the training command and omit the model option -m from the testing command.
You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:
Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.
During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
Similarities/Differences with PAN-2014For your convenience, we summarize here the main similarities/differences of the author identification task @PAN-2015 with respect to the corresponding task @PAN-2014:
- The task definition is essentially the same
- The format of corpus and ground truth is the same
- The format of input/output of your software is the same
- The positive/negative problems are equally distributed
- The evaluation measures are the same
- It is possible (optionally) to submit a trainable version of your approach to be used with any given training corpus
- The genre and/or topic of the documents within a verification problem may differ significantly.
We refer you to:
- PAN @ CLEF'14 (overview paper)
- PAN @ CLEF'13 (overview paper)
- PAN @ CLEF'12 (overview paper)
- PAN @ CLEF'11 (overview paper)
- Patrick Juola. Authorship Attribution. In Foundations and Trends in Information Retrieval, Volume 1, Issue 3, March 2008.
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods Authorship Attribution. Journal of the American Society for Information Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
- Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pages 538-556, March 2009.