Synopsis

  • Task: Given the text of a Twitter feed, determine identify the authors gender and language variety.
  • Input: [data]
  • Submission: [submit]

Introduction

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Task

Gender and language variety identification in Twitter. Demographics traits such as gender and language variety have so far investigated separately. In this task we will provided participantes with a Twitter corpus annotated with authors' gender and their specific variation of their native language:

  • English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
  • Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
  • Portuguese (Brazil, Portugal)
  • Arabic (Egypt, Gulf, Levantine, Maghrebi)    

Although we suggest to participate in both subtasks (gender and language identification) and in all languages, it is possible participating only in one of them and in some of the languages.

Award

We are happy to announce that the best performing team at the 5th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

  • Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. University of Groningen, Netherlands.

Congratulations!

Data

To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Portuguese and Arabic, labeled with gender and language variety.

Download corpus (Updated March 10, 2017)

Info about additional training material (although domains are different): http://ttg.uni-saarland.de/resources/DSLCC

Test Corpus

Download test corpus + truth files (Updated March 16, 2017)

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="author-id"
	  lang="en|es|pt|ar"
	  variety="australia|canada|great britain|ireland|new zealand|united states|
	  	argentina|chile|colombia|mexico|peru|spain|venezuela|
		portugal|brazil|
		gulf|levantine|maghrebi|egypt"
	  gender="male|female"
  />
  

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Evaluation

The performance of your author profiling solution will be ranked by accuracy.

For each language, we will calculate individual accuracies for gender and variety identification. Then, we will calculate the accuracy when BOTH variety and gender are properly predicted together. Finally, we will average the accuracy values per language to obtain the final ranking.

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

We provide with three baselines:

  • LDR-baseline: It is described in A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
  • BOW-baseline: A common bag-of-words with the 1000 most frequent ones
  • STAT-baseline: A statistical baseline (majority class or random choice)
GLOBAL RANKING
RANK TEAM GENDER VARIETY JOINT AVERAGE
1 Basile et al. 0.8253 0.9184 0.8361 0.8599
2 Martinc et al. 0.8224 0.9085 0.8285 0.8531
3 Tellez et al. 0.8097 0.9171 0.8258 0.8509
4 Miura et al. 0.8127 0.8982 0.8162 0.8424
5 López-Monroy et al. 0.8047 0.8986 0.8111 0.8381
6 Markov et al. 0.7957 0.9056 0.8097 0.8370
7 Poulston et al. 0.7974 0.8786 0.7942 0.8234
8 Sierra et al. 0.7641 0.8911 0.7822 0.8125
LDR-baseline 0.7325 0.9187 0.7750 0.8087
9 Ogaltsov & Romanov 0.7669 0.8591 0.7653 0.7971
10 Franco-Salvador et al. 0.7667 0.8508 0.7582 0.7919
11 Schaetti 0.7207 0.8864 0.7511 0.7861
12 Kodiyan et al. 0.7531 0.8522 0.7509 0.7854
13 Ciobanu et al. 0.7504 0.8524 0.7498 0.7842
14 Kheng et al. 0.7002 0.8513 0.7176 0.7564
15 Ganesh* 0.7342 0.7626 0.6881 0.7283
16 Kocher & Savoy 0.7178 0.7661 0.6813 0.7217
17 Ignatov et al. 0.6917 0.7024 0.6270 0.6737
BOW-baseline 0.6763 0.6907 0.6195 0.6622
18 Khan 0.6252 0.5296 0.4952 0.5500
19 Ribeiro-Oliveira et al. 0.3666 0.4141 0.3092 0.3633
STAT-baseline 0.5000 0.2649 0.2991 0.3547
20 Alrifai et al. 0.1806 0.1888 0.1701 0.1798
21 Bouzazi* 0.1530 0.0931 0.1027 0.1163
22 Adame et al. 0.1353 0.0476 0.0695 0.0841
VARIETY RANKING
RANK TEAM ARABIC ENGLISH PORTUGUESE SPANISH AVERAGE
LDR-baseline 0.8250 0.8996 0.9875 0.9625 0.9187
1 Basile et al. 0.8313 0.8988 0.9813 0.9621 0.9184
2 Tellez et al. 0.8275 0.9004 0.9850 0.9554 0.9171
3 Martinc et al. 0.8288 0.8688 0.9838 0.9525 0.9085
4 Markov et al. 0.8169 0.8767 0.9850 0.9439 0.9056
5 López-Monroy et al. 0.8119 0.8567 0.9825 0.9432 0.8986
6 Miura et al. 0.8125 0.8717 0.9813 0.9271 0.8982
7 Sierra et al. 0.7950 0.8392 0.9850 0.9450 0.8911
8 Schaetti 0.8131 0.8150 0.9838 0.9336 0.8864
9 Poulston et al. 0.7975 0.8038 0.9763 0.9368 0.8786
10 Ogaltsov & Romanov 0.7556 0.8092 0.9725 0.8989 0.8591
11 Ciobanu et al. 0.7569 0.7746 0.9788 0.8993 0.8524
12 Kodiyan et al. 0.7688 0.7908 0.9350 0.9143 0.8522
13 Kheng et al. 0.7544 0.7588 0.9750 0.9168 0.8513
14 Franco-Salvador et al. 0.7656 0.7588 0.9788 0.9000 0.8508
15 Kocher & Savoy 0.7188 0.6521 0.9725 0.7211 0.7661
16 Ganesh* 0.7144 0.6021 0.9650 0.7689 0.7626
17 Ignatov et al. 0.4488 0.5813 0.9763 0.8032 0.7024
BOW-baseline 0.3394 0.6592 0.9712 0.7929 0.6907
18 Khan 0.5844 0.2779 0.9063 0.3496 0.5296
19 Ribeiro-Oliveira et al. 0.6713 0.9850 0.4141
STAT-baseline 0.2500 0.1667 0.5000 0.1429 0.2649
20 Alrifai et al. 0.7550 0.1888
21 Bouzazi* 0.3725 0.0931
22 Adame et al. 0.1904 0.0476
GENDER RANKING
RANK TEAM ARABIC ENGLISH PORTUGUESE SPANISH AVERAGE
1 Basile et al. 0.8006 0.8233 0.8450 0.8321 0.8253
2 Martinc et al. 0.8031 0.8071 0.8600 0.8193 0.8224
3 Miura et al. 0.7644 0.8046 0.8700 0.8118 0.8127
4 Tellez et al. 0.7838 0.8054 0.8538 0.7957 0.8097
5 López-Monroy et al. 0.7763 0.8171 0.8238 0.8014 0.8047
6 Poulston et al. 0.7738 0.7829 0.8388 0.7939 0.7974
7 Markov et al. 0.7719 0.8133 0.7863 0.8114 0.7957
8 Ogaltsov & Romanov 0.7213 0.7875 0.7988 0.7600 0.7669
9 Franco-Salvador et al. 0.7300 0.7958 0.7688 0.7721 0.7667
10 Sierra et al. 0.6819 0.7821 0.8225 0.7700 0.7641
11 Kodiyan et al. 0.7150 0.7888 0.7813 0.7271 0.7531
12 Ciobanu et al. 0.7131 0.7642 0.7713 0.7529 0.7504
13 Ganesh* 0.6794 0.7829 0.7538 0.7207 0.7342
LDR-baseline 0.7044 0.7220 0.7863 0.7171 0.7325
14 Schaetti 0.6769 0.7483 0.7425 0.7150 0.7207
15 Kocher & Savoy 0.6913 0.7163 0.7788 0.6846 0.7178
16 Kheng et al. 0.6856 0.7546 0.6638 0.6968 0.7002
17 Ignatov et al. 0.6425 0.7446 0.6850 0.6946 0.6917
BOW-baseline 0.5300 0.7075 0.7812 0.6864 0.6763
18 Khan 0.5863 0.6692 0.6100 0.6354 0.6252
STAT-baseline 0.5000 0.5000 0.5000 0.5000 0.5000
19 Ribeiro-Oliveira et al. 0.7013 0.7650 0.3666
20 Alrifai et al. 0.7225 0.1806
21 Bouzazi* 0.6121 0.1530
22 Adame et al. 0.5413 0.1353
JOINT RANKING
RANK TEAM ARABIC ENGLISH PORTUGUESE SPANISH AVERAGE
1 Basile et al. 0.6831 0.7429 0.8288 0.8036 0.7646
2 Martinc et al. 0.6825 0.7042 0.8463 0.7850 0.7545
3 Tellez et al. 0.6713 0.7267 0.8425 0.7621 0.7507
4 Miura et al. 0.6419 0.6992 0.8575 0.7518 0.7376
5 López-Monroy et al. 0.6475 0.7029 0.8100 0.7604 0.7302
6 Markov et al. 0.6525 0.7125 0.7750 0.7704 0.7276
7 Poulston et al. 0.6356 0.6254 0.8188 0.7471 0.7067
8 Sierra et al. 0.5694 0.6567 0.8113 0.7279 0.6913
LDR-baseline 0.5888 0.6357 0.7763 0.6943 0.6738
9 Ogaltsov & Romanov 0.5731 0.6450 0.7775 0.6846 0.6701
10 Franco-Salvador et al. 0.5688 0.6046 0.7525 0.7021 0.6570
11 Kodiyan et al. 0.5688 0.6263 0.7300 0.6646 0.6474
12 Ciobanu et al. 0.5619 0.5904 0.7575 0.6764 0.6466
13 Schaetti 0.5681 0.6150 0.7300 0.6718 0.6462
14 Kheng et al. 0.5475 0.5704 0.6475 0.6400 0.6014
15 Ganesh* 0.5075 0.4713 0.7300 0.5614 0.5676
16 Kocher & Savoy 0.5206 0.4650 0.7575 0.4971 0.5601
BOW-baseline 0.1794 0.4713 0.7588 0.5561 0.4914
17 Ignatov et al. 0.2875 0.4333 0.6675 0.5593 0.4869
18 Khan 0.3650 0.1900 0.5488 0.2189 0.3307
19 Ribeiro-Oliveira et al. 0.4831 0.7538 0.3092
20 Alrifai et al. 0.5638 0.1410
STAT-baseline 0.1250 0.0833 0.2500 0.0714 0.1324
21 Bouzazi* 0.2479 0.0620
22 Adame et al. 0.1017 0.0254
MeaningCloud

Task Committee