PAN 2020 is now live!

Author Identification

PAN at CLEF 2012

This task is divided into authorship attribution and sexual predator identification. You can choose to solve one or both of them.

Authorship Attribution

Task

Within the traditional authorship tasks there are different flavors:

  • Traditional (closed-class /open-class, with varying numbers of candidate authors) authorship attribution. Within the closed class you will be given a closed set of candidate authors and are asked to identify which one of them is the author of an anonymous text. Withing the open class you have to consider also that it might be that none of the candidates is the real author of the document.

  • Authorship clustering/intrinsic plagiarism: in this problem you are given a text (which, for simplicity, is segmented into a sequence of "paragraphs") and are asked to cluster the paragraphs into exactly two clusters: one that includes paragraphs written by the "main" author of the text and another that includes all paragraphs written by anybody else. (Thus, this year the intrinsic plagiarism has been moved from the plagiarism task to the author identification track.).

Training Corpus

To develop your software, we provide you with a training corpus that comprises several different common attribution and clustering scenarios.

Learn more » Download corpus

Output

As per repeated requests, here is a sample submission format to use for the Traditional Authorship Attribution Competition for PAN/CLEF. Please note that following this format is not mandatory and we will continue to accept anything we can interpret.

For traditional authorship problems (e.g. problem A), use the following (all words in ALL CAPS should be filled out appropriately):

team TEAM NAME : run RUN NUMBER
task TASK IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER
...

For problems E and F, there are no designated sample authors, so we recommend listing paragraph numbers. Author identifier is optional and arbitrary -- if it makes you feel better to talk about authors A and B or authors 1 and 2 you can insert it into the appropriate field. Any paragraphs not listed will be assumed to be part of an unnamed default author.

team TEAM NAME : run RUN NUMBER
task TASK IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER (PARAGRAPH LIST)
file TEST FILE = AUTHOR IDENTIFIER
...

For example:

team Jacob : run 1
task B
file 12Btest01.txt = A
file 12Btest02.txt = A
file 12Btest03.txt = A
file 12Btest04.txt = None of the Above
file 12Btest05.txt = A
file 12Btest06.txt = A
file 12Btest07.txt = A
file 12Btest08.txt = A
file 12Btest09.txt = A
file 12Btest10.txt = A

task C
file 12Ctest01.txt = A
file 12Ctest02.txt = A
file 12Ctest03.txt = A
file 12Ctest04.txt = A
file 12Ctest05.txt = A
file 12Ctest06.txt = A
file 12Ctest07.txt = A
file 12Ctest08.txt = A
file 12Ctest09.txt = A

task F
file 12Ftest01.txt = (1,2,3,6,7)
file 12Ftest01.txt = (4,5)

In this sample file, we consider anything not listed in task F (paragraphs 8 and beyond) to be a third, unnamed author.

Performance Measures

The performance of your authorship attribution will be judged by average precision, recall, and F1 over all authors in the given training set.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

Authorship attribution performance
Overall Participant
86.37 Marius Popescu* and Cristian Grozea°
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
83.40 Navot Akiva
Bar Ilan University, Israel
82.41 Michael Ryan and John Noecker Jr
Duquesne University, USA
70.81 Ludovic Tanguy, Franck Sajous, Basilio Calderone, and Nabil Hathout
CLLE-ERSS: CNRS and University of Toulouse, France
62.13 Esteban Castillo°, Darnes Vilariño°, David Pinto°, Iván Olmos°, Jesús A. González*, and Maya Carrillo°
°Benemérita Universidad Autónoma de Puebla and *Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Mexico
59.77 François-Marie Giraud and Thierry Artières
LIP6, Université Pierre et Marie Curie (UPMC), France
58.35 Upendra Sapkota and Thamar Solorio
University of Alabama at Birmingham, USA
57.55 Ramon de Graaff° and Cor J. Veenman*
°Leiden University and *Netherlands Forensics Institute, The Netherlands
57.40 Stefan Ruseti and Traian Rebedea
University Politehnica of Bucharest, Romania
54.88 Anna Vartapetiance and Lee Gillam
University of Surrey, UK
43.18 Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
16.63 Julian Brooke and Graeme Hirst
University of Toronto, Canada

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Complete performances (Excel) Learn more »

Related Work

For an overview of approaches to automated authorship attribution, we refer you to recent survey papers in the area:

Sexual Predator Identification

Task

The goal of this sub-task is to identify classes of authors, namely online predators. You will be given chat logs involving two (or more) people and have to determine who is the one trying to convince the other(s) to provide some sexual favour. You will also need to identify the particular conversation where the person exploits his bad behavior.

The task can therefore be divided into two parts:

  1. Identify the predators (within all the users)
  2. Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior

Given the public nature of the dataset, we ask the participants not to use external or online resources for resolving this task (e.g. search engines) but to extract evidence from the provided datasets only.

Training Corpus

To develop your software, we provide you with a training corpus that consisting of chat logs where minors and adults pretending to minors are chatting.

Learn more » Download corpus

Output

For each of the two parts we require a different format.

Identify the predators (within all the users).

Participants should update a text file containing an user-id per line, of those identified as predator only:

...
a7c5056a2c30e2dc637907f448934ca3
58f15bbb100bbeb6963b4b967ce04bdf
e040eb115e3f7ad3824e93141665fc2a
3d57ed3fac066fa4f8a52432db51c019
...

Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior.

Participants should update an xml file similar to the corpus ones, containing conversation-ids and message line numbers considered suspicious (line numbers together with all the others message information: author, time, text):

<conversations>
  ...
  <conversation id="0042762e26ed295a8576806f5548cad9">
    <message line="3">
      <author>f069dbec9ab3e090972d432db279e3eb</author>
      <time>03:20</time>
      <text>whats up?</text>
    </message>
    <message line="4">
      <author>f069dbec9ab3e090972d432db279e3eb</author>
      <time>03:21</time>
      <text>how u doing?</text>
    </message>
    ...
    <message line="10">
      <author>f069dbec9ab3e090972d432db279e3eb</author>
      <time>04:00</time>
      <text>sse you llater?</text>
    </message>
  </conversation>
  ...
  <conversation id="0209b0a30c8eced86863631ada73a530">
    <message line="3">
      <author>0042762e26ed295a8576806f5548cad9</author>
      <time>01:17</time>
      <text>and that i dont touch u</text>
    </message>
  </conversation>
  ...
<conversations>

Performance Measures

The performance of your predator identification will be judged by average precision, recall, and F over all persons involved and lines of the conversations. In addition, we consider F0.5 for predator identification, and F3 for predator line identification, which will be used to rank participants, respectively.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

Sexual predator identification performance: 1) identify predators
F0.5 Participant
0.9346 Esaú Villatoro-Tello°*, Antonio Juárez-González°, Hugo J. Escalante°, Manuel Montes-y-Gómez°, and Luis Villaseñor-Pineda°
°Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE) and *Universidad Autónoma Metropolitana, Mexico
0.9168 Tim Snider
Porfiau Inc., Canada
0.8691 Javier Parapar°, David E. Losada*, and Alvaro Barreiro°
°University of A Coruña and *Universidade de Santiago de Compostela, Spain
0.8652 Colin Morris and Graeme Hirst
University of Toronto, Canada
0.8638 Gunnar Eriksson and Jussi Karlgren
Gavagai AB, Sweden
0.8137 Claudia Peersman, Frederik Vaassen, Vincent Van Asch, and Walter Daelemans
University of Antwerp, Netherlands
0.7316 Cristian Grozea° and Marius Popescu*
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
0.7060 Rachel Sitarz
Purdue University, USA
0.5537 Anna Vartapetiance and Lee Gillam
University of Surrey, UK
0.3946 April Kontostathis°, Andy Garron*, Kelly Reynolds^, Will West^, and Lynne Edwards°
°Ursinus College, *The University of Maryland, and ^Lehigh University, USA
0.2554 In-Su Kang°, Chul-Kyu Kim°, Shin-Jae Kang*, and Seung-Hoon Na^
°Kyungsung University, *Daegu University, and ^Electronics and Telecommunications Research Institute, South Korea
0.1791 Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
0.0316 Dasha Bogdanova° and Paolo Rosso*
°University of Saint Petersburg, Russia, and *Universitat Politècnica de València, Spain
0.0250 Sriram Prasath Elango
KTH/Gavagai, Sweden
0.0232 Darnes Vilariño, Esteban Castillo, David Pinto, Iván Olmos, Saul León
Benemérita Universidad Autónoma de Puebla, Mexico
0.0059 José María Gómez Hidalgo° and Andrés Alfonso Caurcel Díaz*
°Optenet and *Universidad Politécnica de Madrid, Spain
Sexual predator identification performance: 2) identify predator lines
F3 Participant
0.4762 Cristian Grozea° and Marius Popescu*
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
0.4174 April Kontostathis°, Andy Garron*, Kelly Reynolds^, Will West^, and Lynne Edwards°
°Ursinus College, *The University of Maryland, and ^Lehigh University, USA
0.2679 Claudia Peersman, Frederik Vaassen, Vincent Van Asch, and Walter Daelemans
University of Antwerp, Netherlands
0.2364 Rachel Sitarz
Purdue University, USA
0.1986 Colin Morris and Graeme Hirst
University of Toronto, Canada
0.1838 Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
0.1633 Gunnar Eriksson and Jussi Karlgren
Gavagai AB, Sweden
0.0770 Sriram Prasath Elango
KTH/Gavagai, Sweden
0.0174 Javier Parapar°, David E. Losada*, and Alvaro Barreiro°
°University of A Coruña and *Universidade de Santiago de Compostela, Spain
0.0154 Anna Vartapetiance and Lee Gillam
University of Surrey, UK
0.0074 Darnes Vilariño, Esteban Castillo, David Pinto, Iván Olmos, Saul León
Benemérita Universidad Autónoma de Puebla, Mexico
0.0007 Dasha Bogdanova° and Paolo Rosso*
°University of Saint Petersburg, Russia, and *Universitat Politècnica de València, Spain
0.0002 Esaú Villatoro-Tello°*, Antonio Juárez-González°, Hugo J. Escalante°, Manuel Montes-y-Gómez°, and Luis Villaseñor-Pineda°
°Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE) and *Universidad Autónoma Metropolitana, Mexico
0.0000 José María Gómez Hidalgo° and Andrés Alfonso Caurcel Díaz*
°Optenet and *Universidad Politécnica de Madrid, Spain

A more detailed analysis of the detection performances with respect to precision, recall, and granularity can be found in the overview paper accompanying this task.

Learn more »