Author Identification
2012

This task is divided into authorship attribution and sexual predator identification. You can choose to solve one or both of them.

Authorship Attribution

Task

Within the traditional authorship tasks there are different flavors:

  • Traditional (closed-class /open-class, with varying numbers of candidate authors) authorship attribution. Within the closed class you will be given a closed set of candidate authors and are asked to identify which one of them is the author of an anonymous text. Withing the open class you have to consider also that it might be that none of the candidates is the real author of the document.

  • Authorship clustering/intrinsic plagiarism: in this problem you are given a text (which, for simplicity, is segmented into a sequence of "paragraphs") and are asked to cluster the paragraphs into exactly two clusters: one that includes paragraphs written by the "main" author of the text and another that includes all paragraphs written by anybody else. (Thus, this year the intrinsic plagiarism has been moved from the plagiarism task to the author identification track.).

Training Corpus

To develop your software, we provide you with a training corpus that comprises several different common attribution and clustering scenarios.

Learn more » Download corpus

Output

As per repeated requests, here is a sample submission format to use for the Traditional Authorship Attribution Competition for PAN/CLEF. Please note that following this format is not mandatory and we will continue to accept anything we can interpret.

For traditional authorship problems (e.g. problem A), use the following (all words in ALL CAPS should be filled out appropriately):

team TEAM NAME : run RUN NUMBER
task TASK IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER
...

For problems E and F, there are no designated sample authors, so we recommend listing paragraph numbers. Author identifier is optional and arbitrary -- if it makes you feel better to talk about authors A and B or authors 1 and 2 you can insert it into the appropriate field. Any paragraphs not listed will be assumed to be part of an unnamed default author.

team TEAM NAME : run RUN NUMBER
task TASK IDENTIFIER
file TEST FILE = AUTHOR IDENTIFIER (PARAGRAPH LIST)
file TEST FILE = AUTHOR IDENTIFIER
...

For example:

team Jacob : run 1
task B
file 12Btest01.txt = A
file 12Btest02.txt = A
file 12Btest03.txt = A
file 12Btest04.txt = None of the Above
file 12Btest05.txt = A
file 12Btest06.txt = A
file 12Btest07.txt = A
file 12Btest08.txt = A
file 12Btest09.txt = A
file 12Btest10.txt = A

task C
file 12Ctest01.txt = A
file 12Ctest02.txt = A
file 12Ctest03.txt = A
file 12Ctest04.txt = A
file 12Ctest05.txt = A
file 12Ctest06.txt = A
file 12Ctest07.txt = A
file 12Ctest08.txt = A
file 12Ctest09.txt = A

task F
file 12Ftest01.txt = (1,2,3,6,7)
file 12Ftest01.txt = (4,5)

In this sample file, we consider anything not listed in task F (paragraphs 8 and beyond) to be a third, unnamed author.

Performance Measures

The performance of your authorship attribution will be judged by average precision, recall, and F1 over all authors in the given training set.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

Authorship attribution performance
OverallParticipant
86.37Marius Popescu* and Cristian Grozea°
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
83.40Navot Akiva
Bar Ilan University, Israel
82.41Michael Ryan and John Noecker Jr
Duquesne University, USA
70.81Ludovic Tanguy, Franck Sajous, Basilio Calderone, and Nabil Hathout
CLLE-ERSS: CNRS and University of Toulouse, France
62.13Esteban Castillo°, Darnes Vilariño°, David Pinto°, Iván Olmos°, Jesús A. González*, and Maya Carrillo°
°Benemérita Universidad Autónoma de Puebla and *Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Mexico
59.77François-Marie Giraud and Thierry Artières
LIP6, Université Pierre et Marie Curie (UPMC), France
58.35Upendra Sapkota and Thamar Solorio
University of Alabama at Birmingham, USA
57.55Ramon de Graaff° and Cor J. Veenman*
°Leiden University and *Netherlands Forensics Institute, The Netherlands
57.40Stefan Ruseti and Traian Rebedea
University Politehnica of Bucharest, Romania
54.88Anna Vartapetiance and Lee Gillam
University of Surrey, UK
43.18Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
16.63Julian Brooke and Graeme Hirst
University of Toronto, Canada

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Complete performances (Excel) Learn more »

Related Work

For an overview of approaches to automated authorship attribution, we refer you to recent survey papers in the area:

Sexual Predator Identification

Task

The goal of this sub-task is to identify classes of authors, namely online predators. You will be given chat logs involving two (or more) people and have to determine who is the one trying to convince the other(s) to provide some sexual favour. You will also need to identify the particular conversation where the person exploits his bad behavior.

The task can therefore be divided into two parts:

  1. Identify the predators (within all the users)
  2. Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior

Given the public nature of the dataset, we ask the participants not to use external or online resources for resolving this task (e.g. search engines) but to extract evidence from the provided datasets only.

Training Corpus

To develop your software, we provide you with a training corpus that consisting of chat logs where minors and adults pretending to minors are chatting.

Learn more » Download corpus

Output

For each of the two parts we require a different format.

  1. Identify the predators (within all the users).

    Participants should update a text file containing an user-id per line, of those identified as predator only:

    ...
    a7c5056a2c30e2dc637907f448934ca3
    58f15bbb100bbeb6963b4b967ce04bdf
    e040eb115e3f7ad3824e93141665fc2a
    3d57ed3fac066fa4f8a52432db51c019
    ...
    

  2. Identify the part (the lines) of the predator conversations which are the most distinctive of the predator bad behavior.

    Participants should update an xml file similar to the corpus ones, containing conversation-ids and message line numbers considered suspicious (line numbers together with all the others message information: author, time, text):

    <conversations>
      ...
      <conversation id="0042762e26ed295a8576806f5548cad9">
        <message line="3">
          <author>f069dbec9ab3e090972d432db279e3eb</author>
          <time>03:20</time>
          <text>whats up?</text>
        </message>
        <message line="4">
          <author>f069dbec9ab3e090972d432db279e3eb</author>
          <time>03:21</time>
          <text>how u doing?</text>
        </message>
        ...
        <message line="10">
          <author>f069dbec9ab3e090972d432db279e3eb</author>
          <time>04:00</time>
          <text>sse you llater?</text>
        </message>
      </conversation>
      ...
      <conversation id="0209b0a30c8eced86863631ada73a530">
        <message line="3">
          <author>0042762e26ed295a8576806f5548cad9</author>
          <time>01:17</time>
          <text>and that i dont touch u</text>
        </message>
      </conversation>
      ...
    <conversations>
    

Performance Measures

The performance of your predator identification will be judged by average precision, recall, and F over all persons involved and lines of the conversations. In addition, we consider F0.5 for predator identification, and F3 for predator line identification, which will be used to rank participants, respectively.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following tables list the performances achieved by the participating teams:

Sexual predator identification performance: 1) identify predators
F0.5Participant
0.9346Esaú Villatoro-Tello°*, Antonio Juárez-González°, Hugo J. Escalante°, Manuel Montes-y-Gómez°, and Luis Villaseñor-Pineda°
°Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE) and *Universidad Autónoma Metropolitana, Mexico
0.9168Tim Snider
Porfiau Inc., Canada
0.8691Javier Parapar°, David E. Losada*, and Alvaro Barreiro°
°University of A Coruña and *Universidade de Santiago de Compostela, Spain
0.8652Colin Morris and Graeme Hirst
University of Toronto, Canada
0.8638Gunnar Eriksson and Jussi Karlgren
Gavagai AB, Sweden
0.8137Claudia Peersman, Frederik Vaassen, Vincent Van Asch, and Walter Daelemans
University of Antwerp, Netherlands
0.7316Cristian Grozea° and Marius Popescu*
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
0.7060Rachel Sitarz
Purdue University, USA
0.5537Anna Vartapetiance and Lee Gillam
University of Surrey, UK
0.3946April Kontostathis°, Andy Garron*, Kelly Reynolds^, Will West^, and Lynne Edwards°
°Ursinus College, *The University of Maryland, and ^Lehigh University, USA
0.2554In-Su Kang°, Chul-Kyu Kim°, Shin-Jae Kang*, and Seung-Hoon Na^
°Kyungsung University, *Daegu University, and ^Electronics and Telecommunications Research Institute, South Korea
0.1791Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
0.0316Dasha Bogdanova° and Paolo Rosso*
°University of Saint Petersburg, Russia, and *Universitat Politècnica de València, Spain
0.0250Sriram Prasath Elango
KTH/Gavagai, Sweden
0.0232Darnes Vilariño, Esteban Castillo, David Pinto, Iván Olmos, Saul León
Benemérita Universidad Autónoma de Puebla, Mexico
0.0059José María Gómez Hidalgo° and Andrés Alfonso Caurcel Díaz*
°Optenet and *Universidad Politécnica de Madrid, Spain
Sexual predator identification performance: 2) identify predator lines
F3Participant
0.4762Cristian Grozea° and Marius Popescu*
°Fraunhofer FIRST, Germany, and *University of Bucharest, Romania
0.4174April Kontostathis°, Andy Garron*, Kelly Reynolds^, Will West^, and Lynne Edwards°
°Ursinus College, *The University of Maryland, and ^Lehigh University, USA
0.2679Claudia Peersman, Frederik Vaassen, Vincent Van Asch, and Walter Daelemans
University of Antwerp, Netherlands
0.2364Rachel Sitarz
Purdue University, USA
0.1986Colin Morris and Graeme Hirst
University of Toronto, Canada
0.1838Roman Kern°*, Stefan Klampfl*, Mario Zechner*
°Graz University of Technology and *Know-Center GmbH, Austria
0.1633Gunnar Eriksson and Jussi Karlgren
Gavagai AB, Sweden
0.0770Sriram Prasath Elango
KTH/Gavagai, Sweden
0.0174Javier Parapar°, David E. Losada*, and Alvaro Barreiro°
°University of A Coruña and *Universidade de Santiago de Compostela, Spain
0.0154Anna Vartapetiance and Lee Gillam
University of Surrey, UK
0.0074Darnes Vilariño, Esteban Castillo, David Pinto, Iván Olmos, Saul León
Benemérita Universidad Autónoma de Puebla, Mexico
0.0007Dasha Bogdanova° and Paolo Rosso*
°University of Saint Petersburg, Russia, and *Universitat Politècnica de València, Spain
0.0002Esaú Villatoro-Tello°*, Antonio Juárez-González°, Hugo J. Escalante°, Manuel Montes-y-Gómez°, and Luis Villaseñor-Pineda°
°Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE) and *Universidad Autónoma Metropolitana, Mexico
0.0000José María Gómez Hidalgo° and Andrés Alfonso Caurcel Díaz*
°Optenet and *Universidad Politécnica de Madrid, Spain

A more detailed analysis of the detection performances with respect to precision, recall, and granularity can be found in the overview paper accompanying this task.

Learn more »

Task Chair

Patrick Juola

Patrick Juola

Duquesne University

Giacomo Inches

Giacomo Inches

University of Lugano

Task Committee

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

Shlomo Argamon

Shlomo Argamon

Illinois Institute of Technology

Moshe Koppel

Moshe Koppel

Bar-Ilan University

Fabio Crestani

Fabio Crestani

University of Lugano