Author Profiling 2018

Synopsis
Introduction
Task
Award
Data
Evaluation
Submission
Results
Related Work
Sponsor
Task Committee

Synopsis

Task: Given the text and images of a Twitter feed, determine identify the authors gender.
Input: [data]
Submission: [submit]

Introduction

Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Task

This year the focus will be on gender identification in Twitter, where text and images may be used as information sources. The languages addressed will be:

English
Spanish
Arabic

Although we suggest to participate in all the languages, it is possible participating only in some of them. Similarly, although we suggest to participate in both textual and image classification, it is possible participating only in some of them.

Award

We are happy to announce that the best performing team at the 6th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki Taniguchi, and Tomoko Ohkuma. Fuji Xerox Co., Ltd.

Congratulations!

Data

To develop your software, we provide you with a training data set that consists of Twitter users labeled with gender. For each author, a total of 100 tweets and 10 images are provided. Authors are grouped by the language of their tweets: English, Arabic and Spanish.

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

<author id="author-id"
    lang="en|es|ar"
    gender_txt="male|female"
    gender_img="male|female"
    gender_comb="male|female" />

We ask you to provide with three different predictions for the author's gender depending on your approach:

gender_txt: gender prediction by using only text
gender_img: gender prediction by using only images
gender_comb: gender prediction by using both text and images

As previously said, you can participate in both textual and images classification, or only in one of them. Hence, if your approach uses only textual features, your prediction should be given in gender_txt. Similarly, if your approach relies on images, your prediction should be given in gender_img. In case you use both text and images, your prediction should be given in gender_comb. Furthermore, in such a case, if you can provide also the prediction by using both approaches separately, this would allow us to perform a more in-depth analysis of the results and to compare textual vs. image based author profiling. In this case, you should provide for the same author the three predictions: gender_txt, gender_img and gender_comb.

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your author profiling solution will be ranked by accuracy. For each language, we will calculate individual accuracies. Then, we will average the accuracy values per language to obtain the final ranking.

Submission

We ask you to prepare your software so that it can be executed via command line calls. More details will be released here soon.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

GLOBAL RANKING
RANK	TEAM	ARABIC	ENGLISH	SPANISH	AVERAGE
1	takahashi18	0.7850	0.8584	0.8159	0.8198
2	daneshvar18	0.8090	0.8221	0.8200	0.8170
3	miranda18	0.8180	0.8068	0.7955	0.8068
4	laporte18	0.7940	0.8132	0.8000	0.8024
5	schuur18	0.7920	0.8074	0.7918	0.7971
6	vaneerden18 --> nieuwenhuis18	0.7870	0.8095	0.7923	0.7963
7	sierraloaiza18	0.8100	0.8063	0.7477	0.7880
8	martinc18	0.7780	0.7926	0.7786	0.7831
9	snijders18	0.7490	0.7926	0.8036	0.7817
10	lopezsantillan18	0.7760	0.7847	0.7677	0.7761
11	miller18	0.7570	0.7947	0.7623	0.7713
12	gouravdas18	0.7680	0.7737	0.7709	0.7709
13	yigal18	0.7570	0.7889	0.7591	0.7683
14	pool18	0.7640	0.7884	0.7432	0.7652
15	vondaeniken18	0.7320	0.7742	0.7464	0.7509
16	schaetti18	0.7390	0.7711	0.7359	0.7487
17	aragonsaenzpardo18	0.6670	0.8016	0.7723	0.7470
18	bayot18	0.6760	0.7716	0.6873	0.7116
19	gariboiorts18	0.6750	0.7363	0.7164	0.7092
20	tekir18	0.6920	0.7495	0.6655	0.7023
21	raiyani18	0.7220	0.7279	0.6436	0.6978
22	sandroni18	0.6870	0.6658	0.6782	0.6770
23	karlgren18	-	0.5521	-	-

ARABIC RANKING
RANK	TEAM	TEXT	IMAGES	COMBINED
1	miranda18	0.8170	0.5900	0.8180
2	sierraloaiza18	0.8120	0.7280	0.8100
3	daneshvar18	0.8090	-	0.8090
4	laporte18	0.7910	0.7010	0.7940
5	schuur18	0.7920		0.7920
6	vaneerden18 --> nieuwenhuis18	0.7830	0.6230	0.7870
7	takahashi18	0.7710	0.7720	0.7850
8	martinc18	0.7760	0.5600	0.7780
9	lopezsantillan18	0.7760	-	0.7760
10	gouravdas18	0.7430	0.6570	0.7680
11	pool18	0.7600	0.6230	0.7640
12	miller18	0.7590	0.4970	0.7570
13	yigal18	0.7590	0.5100	0.7570
14	snijders18	0.7490	-	0.7490
15	schaetti18	0.7390	0.5430	0.7390
16	vondaeniken18	0.7320	-	0.7320
17	raiyani18	0.7220	-	0.7220
18	tekir18	0.6920	-	0.6920
19	sandroni18	0.6870	-	0.6870
20	bayot18	0.6760	-	0.6760
21	gariboiorts18	0.6750	-	0.6750
22	aragonsaenzpardo18	0.6480	0.6800	0.6670
23	karlgren18	-	-	-

ENGLISH RANKING
RANK	TEAM	TEXT	IMAGES	COMBINED
1	takahashi18	0.7968	0.8163	0.8584
2	daneshvar18	0.8221	-	0.8221
3	laporte18	0.8074	0.6963	0.8132
4	vaneerden18 --> nieuwenhuis18	0.8116	0.6100	0.8095
5	schuur18	0.8074	-	0.8074
6	miranda18	0.8121	0.5468	0.8068
7	sierraloaiza18	0.8011	0.7442	0.8063
8	aragonsaenzpardo18	0.7963	0.6921	0.8016
9	miller18	0.7911	0.5174	0.7947
10	snijders18	0.7926	-	0.7926
11	martinc18	0.7900	0.5826	0.7926
12	yigal18	0.7911	0.4942	0.7889
13	pool18	0.7853	0.6584	0.7884
14	lopezsantillan18	0.7847	-	0.7847
15	vondaeniken18	0.7742	-	0.7742
16	gouravdas18	0.7558	0.6747	0.7737
17	bayot18	0.7716	-	0.7716
18	schaetti18	0.7711	0.5763	0.0.7711
19	tekir18	0.7495	-	0.7495
20	gariboiorts18	0.7363	-	0.7363
21	raiyani18	0.7279	-	0.7279
22	sandroni18	0.6658	-	0.6658
23	karlgren18	0.5521	-	0.5521

SPANISH RANKING
RANK	TEAM	TEXT	IMAGES	COMBINED
1	daneshvar18	0.8200	-	0.8200
2	takahashi18	0.7864	0.7732	0.8159
3	snijders18	0.8036	-	0.8036
4	laporte18	0.7959	0.6805	0.8000
5	miranda18	0.8005	0.5691	0.7955
6	vaneerden18 --> nieuwenhuis18	0.8027	0.5873	0.7923
7	schuur18	0.7918	-	0.7918
8	martinc18	0.7782	0.5486	0.7786
9	aragonsaenzpardo18	0.7686	0.6668	0.7723
10	gouravdas18	0.7586	0.6918	0.7709
11	lopezsantillan18	0.7677	-	0.7677
12	miller18	0.7650	0.4923	0.7623
13	yigal18	0.7650	0.5027	0.7591
14	sierraloaiza18	0.7827	0.7100	0.7477
15	vondaeniken18	0.7464	-	0.7464
16	pool18	0.7405	0.6232	0.7432
17	schaetti18	0.7359	0.5782	0.7359
18	gariboiorts18	0.7164	-	0.7164
19	bayot18	0.6873	-	0.6873
20	sandroni18	0.6782	-	0.6782
21	tekir18	0.6655	-	0.6655
22	raiyani18	0.6436	-	0.6436
23	karlgren18	-	-	-

Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784
Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.
Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
Francisco Rangel, Paolo Rosso. On the Impact of Emotions on Author Profiling. In: Information Processing & Management, vol. 52, issue 1, pp. 73-92
S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.
PAN-AP-17 corpus - Author Profiling Shared Task
PAN-AP-16 corpus - Author Profiling Shared Task
PAN-AP-15 corpus - Author Profiling Shared Task
PAN-AP-14 corpus - Author Profiling Shared Task
PAN-AP-13 corpus - Author Profiling Shared Task
The Blog Authorship Corpus

Task Committee

Francisco Rangel

Paolo Rosso