Author Profiling 2015
Synopsis
- Task: Given a document, what're its author's traits?
- Input: [data]
- Twitter Downloader: [code]
- Submission: [submit]
Introduction
Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.
Task
This task is about predicting an author's demographics from her writing. Participants will be provided with Twitter tweets in English and Spanish to predict age, gender and personality traits. Moreover, they will be provided also with tweets in Italian and Dutch and asked to predict the gender and personality.
Award
We are happy to announce that the best performing team at the 3rd International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.
- Miguel Ángel Álvarez Carmona, Adrián Pastor López Monroy, Manuel Montes y Gómez and Luis Villaseñor Pineda from INAOE, Mexico
Congratulations!
Input
To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Italian and Dutch. With regard to age, we will consider the following classes: 18-24, 25-34, 35-49, 50-xx. With regard to personality traits, for each trait we will provide scores (between -0.5 and 0.5).
Output
Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:
<author id="{author-id}" type="twitter" lang="en|es|it|nl" age_group="18-24|25-34|35-49|50-xx" gender="male|female" extroverted="-0.5 to +0.5" stable="-0.5 to +0.5" agreeable="-0.5 to +0.5" conscientious="-0.5 to +0.5" open="-0.5 to +0.5" />
The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.
Evaluation
The performance of your author profiling solution for age and gender will be ranked by accuracy.
For personality identification the average Root Mean Squared Error (RMSE) will be used.
For obtaining a global ranking, we apply the following formula: global_ranking = ((1-RMSE) + joint_accuracy)/ 2
Results
The following table lists the performances achieved by the participating teams:
Author profiling performance | |
---|---|
Avg. Accuracy | Team |
0.8404 | Miguel Ángel Álvarez Carmona, Adrián Pastor López Monroy, Manuel Montes y Gómez, Luis Villaseñor Pineda and Hugo Jair Escalante. INAOE Mexico. |
0.8346 | Carlos E. González Gallardo, Azucena Montes Redón, Gerardo Eugenio Sierra Martínez, José Antonio Nuñez Juárez, Adolfo Jonathan Salinas López and Juan Rodrigo Ek Catzin. UNAM Mexico. |
0.8078 | Andreas Grivas, Anastasia Krithara and George Giannakopoulos. NCSR Demokritos, Greece. |
0.7875 | Mirco Kocher. University of Neuchâtel, Switzerland. |
0.7755 | Octavia Maria Sulea and Daniel Dichiu. Bitdefender and University of Bucharest, Romania. |
0.7584 | Lesly Miculicich. University of Necuhatel, Switzerland. |
0.7338 | Scot Nowson, Julien Perez, Caroline Brun, Shachar Mirkin and Claude Roux. Xerox Research Centre Europe, France. |
0.7223 | Edson Roberto Duarte Weren. Brazil. |
0.7130 | Adam Poulston, Mark Stevenson and Kalina Bontcheva. University of Sheffield, United Kingdom. |
0.7061 | Suraj Maharjan and Thamar Solorio. University of Houston. United States. |
0.6960 | Caitlin McCollister, Bo Lou and Shu Huang. University of Kansas, United States. |
0.6875 | Mounica Arroju, Aftab Hassan and Golnoosh Farnadi. University of Washington Tacoma, United States. |
0.6857 | Mayte Gimenez, Delia Irazú Hernández and Ferran Plá. Universitat Politècnica de València, Spain. |
0.6809 | Alberto Bartoli, Andrea De Lorenzo, Alessandra Laderchi, Eric Medvet and Fabiano Tarlao. University of Trieste, Italy. |
0.6685 | Ifrah Pervaz, Iqra Ameer, Abdul Sittar, Rao Muhammad Adeel Nawab. COMSATS Institute of Information Technology, Pakistan. |
0.6495 | Fahad Najib, Waqas Arshad Cheema and Rao Muhammad Adeel Nawab. Comsats Lahore, Pakistan. |
0.6401 | Piotr Przybyla and Pawel Teisseyre. Polish Academy of Sciences, Poland. |
0.6204 | Alonso Palomino Garibay, Adolfo T. Camacho González, Ricardo A. Fierro Villaneda, Irazú Hernández Farias, Davide Buscaldi and Ivan Vladimir Meza Ruiz. UNAM, Mexico. |
0.6178 | Roy Bayot, Teresa Gonçalves and Paolo Quaresma. Universidade de Évora, Portugal. |
* | Hafiz Rizwan Iqbal, Muhammad Adnan Ashraf and Rao Muhammad Adeel Nawab. Pakistan. |
* | Yasen Kiprov, Momchil Hardalov, Preslav Nakov and Ivan Koychev. Sofia University "St. Kliment Ohridski", Bulgaria. |
* | Juan Pablo Posadas Durán, Ilia Markov, Helena Gómez Adorno, Grigori Sidorov, Ildar Batyrshin, Alexander Gelbukh and Obdulia Pichardo Lagunas. National Polytechnic Institute, Mexico. |
* Results have been omitted for these teams since they participated in some languages only.
A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.
Related Work
- Fabio Celli, Bruno Lepri, Joan-Isaac Biel, Daniel Gatica-Perez, Giuseppe Riccardi, Fabio Pianesi. The Workshop on Computational Personality Recognition 2014. Proceedings of ACM Multimedia 2014: 1245-1246.
- Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
- Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
- S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
- J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006.
- M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender, Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
- J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.
- Workshop on Computational Personality Recognition 2013
- Workshop on Computational Personality Recognition 2014
- PAN-AP-13 corpus - Author Profiling Shared Task
- PAN-AP-14 corpus - Author Profiling Shared Task
- The Blog Authorship Corpus
- Mypersonality dataset
- Essays dataset
- Mobile dataset
- Youtube dataset (request)