Author Profiling 2017

Synopsis
Introduction
Task
Award
Data
Evaluation
Results
Related Work
Sponsor
Task Committee

Synopsis

Task: Given the text of a Twitter feed, determine identify the authors gender and language variety.
Input: [data]
Submission: [submit]

Introduction

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Task

Gender and language variety identification in Twitter. Demographics traits such as gender and language variety have so far investigated separately. In this task we will provided participantes with a Twitter corpus annotated with authors' gender and their specific variation of their native language:

English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
Portuguese (Brazil, Portugal)
Arabic (Egypt, Gulf, Levantine, Maghrebi)

Although we suggest to participate in both subtasks (gender and language identification) and in all languages, it is possible participating only in one of them and in some of the languages.

Award

We are happy to announce that the best performing team at the 5th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. University of Groningen, Netherlands.

Congratulations!

Data

To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Portuguese and Arabic, labeled with gender and language variety.

Download corpus (Updated March 10, 2017)

Info about additional training material (although domains are different): http://ttg.uni-saarland.de/resources/DSLCC

Test Corpus

Download test corpus + truth files (Updated March 16, 2017)

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="author-id"
	  lang="en|es|pt|ar"
	  variety="australia|canada|great britain|ireland|new zealand|united states|
	  	argentina|chile|colombia|mexico|peru|spain|venezuela|
		portugal|brazil|
		gulf|levantine|maghrebi|egypt"
	  gender="male|female"
  />

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your author profiling solution will be ranked by accuracy.

For each language, we will calculate individual accuracies for gender and variety identification. Then, we will calculate the accuracy when BOTH variety and gender are properly predicted together. Finally, we will average the accuracy values per language to obtain the final ranking.

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

We provide with three baselines:

LDR-baseline: It is described in A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
BOW-baseline: A common bag-of-words with the 1000 most frequent ones
STAT-baseline: A statistical baseline (majority class or random choice)

GLOBAL RANKING
RANK	TEAM	GENDER	VARIETY	JOINT	AVERAGE
1	Basile et al.	0.8253	0.9184	0.8361	0.8599
2	Martinc et al.	0.8224	0.9085	0.8285	0.8531
3	Tellez et al.	0.8097	0.9171	0.8258	0.8509
4	Miura et al.	0.8127	0.8982	0.8162	0.8424
5	López-Monroy et al.	0.8047	0.8986	0.8111	0.8381
6	Markov et al.	0.7957	0.9056	0.8097	0.8370
7	Poulston et al.	0.7974	0.8786	0.7942	0.8234
8	Sierra et al.	0.7641	0.8911	0.7822	0.8125
	LDR-baseline	0.7325	0.9187	0.7750	0.8087
9	Ogaltsov & Romanov	0.7669	0.8591	0.7653	0.7971
10	Franco-Salvador et al.	0.7667	0.8508	0.7582	0.7919
11	Schaetti	0.7207	0.8864	0.7511	0.7861
12	Kodiyan et al.	0.7531	0.8522	0.7509	0.7854
13	Ciobanu et al.	0.7504	0.8524	0.7498	0.7842
14	Kheng et al.	0.7002	0.8513	0.7176	0.7564
15	Ganesh*	0.7342	0.7626	0.6881	0.7283
16	Kocher & Savoy	0.7178	0.7661	0.6813	0.7217
17	Ignatov et al.	0.6917	0.7024	0.6270	0.6737
	BOW-baseline	0.6763	0.6907	0.6195	0.6622
18	Khan	0.6252	0.5296	0.4952	0.5500
19	Ribeiro-Oliveira et al.	0.3666	0.4141	0.3092	0.3633
	STAT-baseline	0.5000	0.2649	0.2991	0.3547
20	Alrifai et al.	0.1806	0.1888	0.1701	0.1798
21	Bouzazi*	0.1530	0.0931	0.1027	0.1163
22	Adame et al.	0.1353	0.0476	0.0695	0.0841

VARIETY RANKING
RANK	TEAM	ARABIC	ENGLISH	PORTUGUESE	SPANISH	AVERAGE
	LDR-baseline	0.8250	0.8996	0.9875	0.9625	0.9187
1	Basile et al.	0.8313	0.8988	0.9813	0.9621	0.9184
2	Tellez et al.	0.8275	0.9004	0.9850	0.9554	0.9171
3	Martinc et al.	0.8288	0.8688	0.9838	0.9525	0.9085
4	Markov et al.	0.8169	0.8767	0.9850	0.9439	0.9056
5	López-Monroy et al.	0.8119	0.8567	0.9825	0.9432	0.8986
6	Miura et al.	0.8125	0.8717	0.9813	0.9271	0.8982
7	Sierra et al.	0.7950	0.8392	0.9850	0.9450	0.8911
8	Schaetti	0.8131	0.8150	0.9838	0.9336	0.8864
9	Poulston et al.	0.7975	0.8038	0.9763	0.9368	0.8786
10	Ogaltsov & Romanov	0.7556	0.8092	0.9725	0.8989	0.8591
11	Ciobanu et al.	0.7569	0.7746	0.9788	0.8993	0.8524
12	Kodiyan et al.	0.7688	0.7908	0.9350	0.9143	0.8522
13	Kheng et al.	0.7544	0.7588	0.9750	0.9168	0.8513
14	Franco-Salvador et al.	0.7656	0.7588	0.9788	0.9000	0.8508
15	Kocher & Savoy	0.7188	0.6521	0.9725	0.7211	0.7661
16	Ganesh*	0.7144	0.6021	0.9650	0.7689	0.7626
17	Ignatov et al.	0.4488	0.5813	0.9763	0.8032	0.7024
	BOW-baseline	0.3394	0.6592	0.9712	0.7929	0.6907
18	Khan	0.5844	0.2779	0.9063	0.3496	0.5296
19	Ribeiro-Oliveira et al.	0.6713		0.9850		0.4141
	STAT-baseline	0.2500	0.1667	0.5000	0.1429	0.2649
20	Alrifai et al.	0.7550				0.1888
21	Bouzazi*		0.3725			0.0931
22	Adame et al.		0.1904			0.0476

GENDER RANKING
RANK	TEAM	ARABIC	ENGLISH	PORTUGUESE	SPANISH	AVERAGE
1	Basile et al.	0.8006	0.8233	0.8450	0.8321	0.8253
2	Martinc et al.	0.8031	0.8071	0.8600	0.8193	0.8224
3	Miura et al.	0.7644	0.8046	0.8700	0.8118	0.8127
4	Tellez et al.	0.7838	0.8054	0.8538	0.7957	0.8097
5	López-Monroy et al.	0.7763	0.8171	0.8238	0.8014	0.8047
6	Poulston et al.	0.7738	0.7829	0.8388	0.7939	0.7974
7	Markov et al.	0.7719	0.8133	0.7863	0.8114	0.7957
8	Ogaltsov & Romanov	0.7213	0.7875	0.7988	0.7600	0.7669
9	Franco-Salvador et al.	0.7300	0.7958	0.7688	0.7721	0.7667
10	Sierra et al.	0.6819	0.7821	0.8225	0.7700	0.7641
11	Kodiyan et al.	0.7150	0.7888	0.7813	0.7271	0.7531
12	Ciobanu et al.	0.7131	0.7642	0.7713	0.7529	0.7504
13	Ganesh*	0.6794	0.7829	0.7538	0.7207	0.7342
	LDR-baseline	0.7044	0.7220	0.7863	0.7171	0.7325
14	Schaetti	0.6769	0.7483	0.7425	0.7150	0.7207
15	Kocher & Savoy	0.6913	0.7163	0.7788	0.6846	0.7178
16	Kheng et al.	0.6856	0.7546	0.6638	0.6968	0.7002
17	Ignatov et al.	0.6425	0.7446	0.6850	0.6946	0.6917
	BOW-baseline	0.5300	0.7075	0.7812	0.6864	0.6763
18	Khan	0.5863	0.6692	0.6100	0.6354	0.6252
	STAT-baseline	0.5000	0.5000	0.5000	0.5000	0.5000
19	Ribeiro-Oliveira et al.	0.7013		0.7650		0.3666
20	Alrifai et al.	0.7225				0.1806
21	Bouzazi*		0.6121			0.1530
22	Adame et al.		0.5413			0.1353

JOINT RANKING
RANK	TEAM	ARABIC	ENGLISH	PORTUGUESE	SPANISH	AVERAGE
1	Basile et al.	0.6831	0.7429	0.8288	0.8036	0.7646
2	Martinc et al.	0.6825	0.7042	0.8463	0.7850	0.7545
3	Tellez et al.	0.6713	0.7267	0.8425	0.7621	0.7507
4	Miura et al.	0.6419	0.6992	0.8575	0.7518	0.7376
5	López-Monroy et al.	0.6475	0.7029	0.8100	0.7604	0.7302
6	Markov et al.	0.6525	0.7125	0.7750	0.7704	0.7276
7	Poulston et al.	0.6356	0.6254	0.8188	0.7471	0.7067
8	Sierra et al.	0.5694	0.6567	0.8113	0.7279	0.6913
	LDR-baseline	0.5888	0.6357	0.7763	0.6943	0.6738
9	Ogaltsov & Romanov	0.5731	0.6450	0.7775	0.6846	0.6701
10	Franco-Salvador et al.	0.5688	0.6046	0.7525	0.7021	0.6570
11	Kodiyan et al.	0.5688	0.6263	0.7300	0.6646	0.6474
12	Ciobanu et al.	0.5619	0.5904	0.7575	0.6764	0.6466
13	Schaetti	0.5681	0.6150	0.7300	0.6718	0.6462
14	Kheng et al.	0.5475	0.5704	0.6475	0.6400	0.6014
15	Ganesh*	0.5075	0.4713	0.7300	0.5614	0.5676
16	Kocher & Savoy	0.5206	0.4650	0.7575	0.4971	0.5601
	BOW-baseline	0.1794	0.4713	0.7588	0.5561	0.4914
17	Ignatov et al.	0.2875	0.4333	0.6675	0.5593	0.4869
18	Khan	0.3650	0.1900	0.5488	0.2189	0.3307
19	Ribeiro-Oliveira et al.	0.4831		0.7538		0.3092
20	Alrifai et al.	0.5638				0.1410
	STAT-baseline	0.1250	0.0833	0.2500	0.0714	0.1324
21	Bouzazi*		0.2479			0.0620
22	Adame et al.		0.1017			0.0254

Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pottast, Benno Stein. Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations. In: Balog K., Capellato L., Ferro N., Macdonald C. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1609, pp. 750-784
Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Pottast, Benno Stein, Walter Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015.In: Linda Cappelato and Nicola Ferro and Gareth Jones and Eric San Juan (Eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/,2015.
Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin Trenkmann, Benno Stein, Ben Verhoeven, Walter Daelemans. Overview of the 2nd Author Profiling Task at PAN 2014. In: Cappellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, vol. 1180, pp. 898-827.
Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstatios Stamatatos, Giacomo Inches. Overview of the Author Profiling Task at PAN 2013. In: Forner P., Navigli R., Tufis D. (Eds.)Notebook Papers of CLEF 2013 LABs and Workshops. CEUR-WS.org, vol. 1179
Marcos Zampieri, Shervin Malmasi, Nikola Ljubešic, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, Noëmi Aepli. Findings of the VarDial Evaluation Campaign 2017. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-15. Valencia, Spain.
Shervin Malmasi, Marcos Zampieri, Nikola Ljubešic, Preslav Nakov, Ahmed Ali, Jörg Tiedemann. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-14. Osaka, Japan.
Marcos Zampieri, Liling Tan, Nikola Ljubešic, Jörg Tiedemann, Preslav Nakov. Overview of the DSL Shared Task 2015. Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). pp. 1-9. Hissar, Bulgaria.
Francisco Rangel, Paolo Rosso. On the Impact of Emotions on Author Profiling. In: Information Processing & Management, vol. 52, issue 1, pp. 73-92
Francisco Rangel, Marc Franco-Salvador, Paolo Rosso. A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text, Communications of the ACM 52 (2): 119–123.
J. Pennebaker (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Publishing, 2011.
PAN-AP-16 corpus - Author Profiling Shared Task
PAN-AP-15 corpus - Author Profiling Shared Task
PAN-AP-14 corpus - Author Profiling Shared Task
PAN-AP-13 corpus - Author Profiling Shared Task
The Blog Authorship Corpus
DSL Corpus Collection

Task Committee

Paolo Rosso

Francisco Rangel