Author Profiling 2013
Synopsis
- Task: Given a document, your task is to determine its author's age and gender.
- Input: [data]
- Submission: [submit]
Introduction
Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
Task
Note. Besides, at RepLab 2013 author profiling will be approached from the online reputation monitoring perspective. Given a large number of Twitter profiles with 600 associated tweets each, participants will be asked to classify the author of a set of tweets as journalist, politician, activist, professional, client, company, authority or citizen, since the fact of belonging to a certain category could determine the importance of the user's opinions. The dataset will contain English and Spanish tweets related to the banking and automotive domains.
Award
We are happy to announce the following overall winner of the 1st International Competition on Author Profiling who will be awarded 300,- Euro sponsored by the Forensic Lab of the Universitat Pompeu Fabra Barcelona.
- Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, Hugo Jair Escalante, and Adrián Pastor López-Monrroy from INAOE, Mexico.
Congratulations!
Input
To develop your software, we provide you with a training data set that consists of documents written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47). Moreover, documents from authors who pretend to be minors will be included (e.g., documents composed of chat lines of sexual predators will be also considered). Learn more »
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="{author-id}" lang="en|es" age_group="10s|20s|30s" gender="male|female" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as
extension. The output files have to be written either directly to the working directory (to "..) or to a
subfolder. The author-id has to be extracted from each document's filename which follows the pattern
<authorid>_<lang>_<age>_<gender>.xml
. Note that in the test corpus
the age and gender information are replaced by "xxx".
Evaluation
The performance of your author profiling solution will be ranked by accuracy.
Results
The following table lists the performances achieved by the participating teams:
English author profiling performance | |
---|---|
Accuracy | Team |
0.3894 | Michał Meina, Karolina Brodzínska, Bartosz Celmer, Maja Czoków, Martyna Patera, Jakub
Pezacki, and Mateusz Wilk Nicolaus Copernicus University, Poland |
0.3813 | A. Pastor López-Monroy°, Manuel Montes-y-Gómez°, Hugo Jair Escalante°, Luis Villaseñor-Pineda°,
and Esaú Villatoro-Tello* °Instituto Nacional de Astrofísica, Óptica y Electrónica and *Universidad Autónoma Metropolitana-Cuajimalpa, Mexico |
0.3677 | Seifeddine Mechti, Maher Jaoua, and Lamia Hadrich Belguith University of Sfax, Tunisia |
0.3508 | K Santosh, Romil Bansal, Mihir Shekhar, and Vasudeva Varma International Institute of Information Technology, India |
0.3488 | Wee-Yong Lim, Jonathan Goh, and Vrizlynn L. L. Thing Institute for Infocomm Research, Singapore |
0.3420 | Susana Ladra°, Francisco Claude*, and Roberto Konow^ °University of A Coruña, Spain, *University of Waterloo, Canada, and ^University of Chile, Chile |
0.3292 | Yuridiana Aleman, Nahun Loya, Darnes Vilariño, and David Pinto Benem´erita Universidad Aut´onoma de Puebla, Mexico |
0.3268 | Lee Gillam University of Surrey, UK |
0.3115 | Roman Kern Know-Center GmbH, Autria |
0.3113 | Fermín L. Cruz°, Rafa Haro R.*, and F. Javier Ortega° University of Seville and Zaizi, Spain |
0.2843 | Aditya Pavan, Aditya Mogadala, and Vasudeva Varma International Institute of Information Technology, India |
0.2840 | Andrés Alfonso Caurcel Díaz° and José María Gómez Hidalgo* Universidad Politécnica de Madrid and Optenet, Spain |
0.2816 | Delia-Irazú Hernández°, Rafael Guzmán-Cabrera*, Antonio Reyes^, and Martha-Alicia Rocha°' °Universidad Politécnica de Valencia, Spain, and *Universidad de Guanajuato, ^Instituto Superior de Intérpretes y Traductores, and 'Instituto Tecnológico de León, Mexico |
0.2813 | Magdalena Jankowska, Vlado Kešelj, and Evangelos Milios Dalhousie University, Canada |
0.2785 | Lucie Flekovayz and Iryna Gurevych Technische Universität Darmstadt and German Institute for Educational Research and Educational Information, Germany |
0.2564 | Edson R. D. Weren, Viviane P. Moreira, and José P. M. de Oliveira UFRGS, Brazil |
0.2471 | Upendra Sapkota°, Thamar Solorio°, Manuel Montes-y-Gómez*, and Gabriela Ramírez-de-la-Rosa° °University of Alabama at Birmingham, USA, and *Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico |
0.2450 | Maria De-Arteaga, Sergio Jimenez, George Dueñas, Sergio Mancera and Julia Baquero Universidad Nacional de Colombia, Colombia |
0.2395 | Erwan Moreau and Carl Vogel Trinity College Dublin, Ireland |
0.1650 | Baseline |
0.1574 | Braja Gopal Patra°, Somnath Banerjee°, Dipankar Das*, Tanik Saikh°, Sivaji Bandyopadhyay° °Jadavpur University and NIT Meghalaya, India |
0.0741 | Leticia Cagnina, Darío Funez, and Marcelo Errecalde Universidad Nacional de San Luis, Argentina |
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.
Related Work
- The Blog Authorship Corpus
- S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
- J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006.
- M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender, Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
- J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.