Author Profiling
2017

MeaningCloud
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Award

We are happy to announce that the best performing team at the 4th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

Task

Gender and language variety identification in Twitter. Demographics traits such as gender and language variety have so far investigated separately. In this task we will provided participantes with a Twitter corpus annotated with authors' gender and their specific variation of their native language:

  • English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
  • Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
  • Portuguese (Brazil, Portugal)
  • Arabic (Egypt, Gulf, Levantine, Maghrebi)    

Although we suggest to participate in both subtasks (gender and language identification) and in all languages, it is possible participating only in one of them and in some of the languages.

Training corpus

To develop your software, we provide you with a training data set that consists of Twitter tweets in English, Spanish, Portuguese and Arabic, labeled with gender and language variety.

Download corpus (Updated March 10, 2017)

Info about additional training material (although domains are different): http://ttg.uni-saarland.de/resources/DSLCC

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="author-id"
	  lang="en|es|pt|ar"
	  variety="australia|canada|great britain|ireland|new zealand|united states|
	  	argentina|chile|colombia|mexico|peru|spain|venezuela|
		portugal|brazil|
		gulf|levantine|maghrebi|egypt"
	  gender="male|female"
  />
  

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Performance Measures

The performance of your author profiling solution will be ranked by accuracy.

For each language, we will calculate individual accuracies for gender and variety identification. Then, we will calculate the accuracy when BOTH variety and gender are properly predicted together. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission

We ask you to prepare your software so that it can be executed via command line calls. More details will be released here soon.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

We provide with three baselines:

  • LDR-baseline: It is described in A Low Dimensionality Representation for Language Variety Identification. In: Postproc. 17th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2016, Springer-Verlag, LNCS(), arXiv:1705.10754
  • BOW-baseline: A common bag-of-words with the 1000 most frequent ones
  • STAT-baseline: A statistical baseline (majority class or random choice)

GLOBAL RANKING
RANKTEAMGENDERVARIETYJOINTAVERAGE
1Basile et al.0.82530.91840.83610.8599
2Martinc et al.0.82240.90850.82850.8531
3Tellez et al.0.80970.91710.82580.8509
4Miura et al.0.81270.89820.81620.8424
5López-Monroy et al.0.80470.89860.81110.8381
6Markov et al.0.79570.90560.80970.8370
7Poulston et al.0.79740.87860.79420.8234
8Sierra et al.0.76410.89110.78220.8125
LDR-baseline0.73250.91870.77500.8087
9Ogaltsov & Romanov0.76690.85910.76530.7971
10Franco-Salvador et al.0.76670.85080.75820.7919
11Schaetti0.72070.88640.75110.7861
12Kodiyan et al.0.75310.85220.75090.7854
13Ciobanu et al.0.75040.85240.74980.7842
14Kheng et al.0.70020.85130.71760.7564
15Ganesh*0.73420.76260.68810.7283
16Kocher & Savoy0.71780.76610.68130.7217
17Ignatov et al.0.69170.70240.62700.6737
BOW-baseline0.67630.69070.61950.6622
18Khan0.62520.52960.49520.5500
19Ribeiro-Oliveira et al.0.36660.41410.30920.3633
STAT-baseline0.50000.26490.29910.3547
20Alrifai et al.0.18060.18880.17010.1798
21Bouzazi*0.15300.09310.10270.1163
22Adame et al.0.13530.04760.06950.0841

VARIETY RANKING
RANKTEAMARABICENGLISHPORTUGUESESPANISHAVERAGE
LDR-baseline0.82500.89960.98750.96250.9187
1Basile et al.0.83130.89880.98130.96210.9184
2Tellez et al.0.82750.90040.98500.95540.9171
3Martinc et al.0.82880.86880.98380.95250.9085
4Markov et al.0.81690.87670.98500.94390.9056
5López-Monroy et al.0.81190.85670.98250.94320.8986
6Miura et al.0.81250.87170.98130.92710.8982
7Sierra et al.0.79500.83920.98500.94500.8911
8Schaetti0.81310.81500.98380.93360.8864
9Poulston et al.0.79750.80380.97630.93680.8786
10Ogaltsov & Romanov0.75560.80920.97250.89890.8591
11Ciobanu et al.0.75690.77460.97880.89930.8524
12Kodiyan et al.0.76880.79080.93500.91430.8522
13Kheng et al.0.75440.75880.97500.91680.8513
14Franco-Salvador et al.0.76560.75880.97880.90000.8508
15Kocher & Savoy0.71880.65210.97250.72110.7661
16Ganesh*0.71440.60210.96500.76890.7626
17Ignatov et al.0.44880.58130.97630.80320.7024
BOW-baseline0.33940.65920.97120.79290.6907
18Khan0.58440.27790.90630.34960.5296
19Ribeiro-Oliveira et al.0.67130.98500.4141
STAT-baseline0.25000.16670.50000.14290.2649
20Alrifai et al.0.75500.1888
21Bouzazi*0.37250.0931
22Adame et al.0.19040.0476

GENDER RANKING
RANKTEAMARABICENGLISHPORTUGUESESPANISHAVERAGE
1Basile et al.0.80060.82330.84500.83210.8253
2Martinc et al.0.80310.80710.86000.81930.8224
3Miura et al.0.76440.80460.87000.81180.8127
4Tellez et al.0.78380.80540.85380.79570.8097
5López-Monroy et al.0.77630.81710.82380.80140.8047
6Poulston et al.0.77380.78290.83880.79390.7974
7Markov et al.0.77190.81330.78630.81140.7957
8Ogaltsov & Romanov0.72130.78750.79880.76000.7669
9Franco-Salvador et al.0.73000.79580.76880.77210.7667
10Sierra et al.0.68190.78210.82250.77000.7641
11Kodiyan et al.0.71500.78880.78130.72710.7531
12Ciobanu et al.0.71310.76420.77130.75290.7504
13Ganesh*0.67940.78290.75380.72070.7342
LDR-baseline0.70440.72200.78630.71710.7325
14Schaetti0.67690.74830.74250.71500.7207
15Kocher & Savoy0.69130.71630.77880.68460.7178
16Kheng et al.0.68560.75460.66380.69680.7002
17Ignatov et al.0.64250.74460.68500.69460.6917
BOW-baseline0.53000.70750.78120.68640.6763
18Khan0.58630.66920.61000.63540.6252
STAT-baseline0.50000.50000.50000.50000.5000
19Ribeiro-Oliveira et al.0.70130.76500.3666
20Alrifai et al.0.72250.1806
21Bouzazi*0.61210.1530
22Adame et al.0.54130.1353

JOINT RANKING
RANKTEAMARABICENGLISHPORTUGUESESPANISHAVERAGE
1Basile et al.0.68310.74290.82880.80360.7646
2Martinc et al.0.68250.70420.84630.78500.7545
3Tellez et al.0.67130.72670.84250.76210.7507
4Miura et al.0.64190.69920.85750.75180.7376
5López-Monroy et al.0.64750.70290.81000.76040.7302
6Markov et al.0.65250.71250.77500.77040.7276
7Poulston et al.0.63560.62540.81880.74710.7067
8Sierra et al.0.56940.65670.81130.72790.6913
LDR-baseline0.58880.63570.77630.69430.6738
9Ogaltsov & Romanov0.57310.64500.77750.68460.6701
10Franco-Salvador et al.0.56880.60460.75250.70210.6570
11Kodiyan et al.0.56880.62630.73000.66460.6474
12Ciobanu et al.0.56190.59040.75750.67640.6466
13Schaetti0.56810.61500.73000.67180.6462
14Kheng et al.0.54750.57040.64750.64000.6014
15Ganesh*0.50750.47130.73000.56140.5676
16Kocher & Savoy0.52060.46500.75750.49710.5601
BOW-baseline0.17940.47130.75880.55610.4914
17Ignatov et al.0.28750.43330.66750.55930.4869
18Khan0.36500.19000.54880.21890.3307
19Ribeiro-Oliveira et al.0.48310.75380.3092
20Alrifai et al.0.56380.1410
STAT-baseline0.12500.08330.25000.07140.1324
21Bouzazi*0.24790.0620
22Adame et al.0.10170.0254
Related Work and Corpora

We refer you to:

Task Chair

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Task Committee

Francisco Rangel

Francisco Rangel

Autoritas Consulting

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Walter Daelemans

Walter Daelemans

University of Antwerp

Efstathios Stamatatos

Efstathios Stamatatos

University of the Aegean

© pan.webis.de