Author Profiling
2018

MeaningCloud
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Award

We are happy to announce that the best performing team at the 6th International Competition on Author Profiling will be awarded 300,- Euro sponsored by MeaningCloud.

  • Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki Taniguchi, and Tomoko Ohkuma. Fuji Xerox Co., Ltd.

Congratulations!

Task

This year the focus will be on gender identification in Twitter, where text and images may be used as information sources. The languages addressed will be:

  • English
  • Spanish
  • Arabic
  • Although we suggest to participate in all the languages, it is possible participating only in some of them. Similarly, although we suggest to participate in both textual and image classification, it is possible participating only in some of them.
Training corpus
To develop your software, we provide you with a training data set that consists of Twitter users labeled with gender. For each author, a total of 100 tweets and 10 images are provided. Authors are grouped by the language of their tweets: English, Arabic and Spanish.

Download corpus (Updated February 27, 2018)

Test corpus

Download test corpus + truth files (Updated March 20, 2018)

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="author-id"
	  lang="en|es|ar"
	  gender_txt="male|female"
	  gender_img="male|female"
	  gender_comb="male|female"
  />
  

We ask you to provide with three different predictions for the author's gender depending on your approach:

  • gender_txt: gender prediction by using only text
  • gender_img: gender prediction by using only images
  • gender_comb: gender prediction by using both text and images

As previously said, you can participate in both textual and images classification, or only in one of them. Hence, if your approach uses only textual features, your prediction should be given in gender_txt. Similarly, if your approach relies on images, your prediction should be given in gender_img. In case you use both text and images, your prediction should be given in gender_comb. Furthermore, in such a case, if you can provide also the prediction by using both approaches separately, this would allow us to perform a more in-depth analysis of the results and to compare textual vs. image based author profiling. In this case, you should provide for the same author the three predictions: gender_txt, gender_img and gender_comb.

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Performance Measures

The performance of your author profiling solution will be ranked by accuracy.

For each language, we will calculate individual accuracies. Then, we will average the accuracy values per language to obtain the final ranking.

Submission

We ask you to prepare your software so that it can be executed via command line calls. More details will be released here soon.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

GLOBAL RANKING
RANKTEAMARABICENGLISHSPANISHAVERAGE
1takahashi180.78500.85840.81590.8198
2daneshvar180.80900.82210.82000.8170
3miranda180.81800.80680.79550.8068
4laporte180.79400.81320.80000.8024
5schuur180.79200.80740.79180.7971
6vaneerden18 --> nieuwenhuis180.78700.80950.79230.7963
7sierraloaiza180.81000.80630.74770.7880
8martinc180.77800.79260.77860.7831
9snijders180.74900.79260.80360.7817
10lopezsantillan180.77600.78470.76770.7761
11miller180.75700.79470.76230.7713
12gouravdas180.76800.77370.77090.7709
13yigal180.75700.78890.75910.7683
14pool180.76400.78840.74320.7652
15vondaeniken180.73200.77420.74640.7509
16schaetti180.73900.77110.73590.7487
17aragonsaenzpardo180.66700.80160.77230.7470
18bayot180.67600.77160.68730.7116
19gariboiorts180.67500.73630.71640.7092
20tekir180.69200.74950.66550.7023
21raiyani180.72200.72790.64360.6978
22sandroni180.68700.66580.67820.6770
23karlgren18-0.5521--
ARABIC RANKING
RANKTEAMTEXTIMAGESCOMBINED
1miranda180.81700.59000.8180
2sierraloaiza180.81200.72800.8100
3daneshvar180.8090-0.8090
4laporte180.79100.70100.7940
5schuur180.79200.7920
6vaneerden18 --> nieuwenhuis180.78300.62300.7870
7takahashi180.77100.77200.7850
8martinc180.77600.56000.7780
9lopezsantillan180.7760-0.7760
10gouravdas180.74300.65700.7680
11pool180.76000.62300.7640
12miller180.75900.49700.7570
13yigal180.75900.51000.7570
14snijders180.7490-0.7490
15schaetti180.73900.54300.7390
16vondaeniken180.7320-0.7320
17raiyani180.7220-0.7220
18tekir180.6920-0.6920
19sandroni180.6870-0.6870
20bayot180.6760-0.6760
21gariboiorts180.6750-0.6750
22aragonsaenzpardo180.64800.68000.6670
23karlgren18---
ENGLISH RANKING
RANKTEAMTEXTIMAGESCOMBINED
1takahashi180.79680.81630.8584
2daneshvar180.8221-0.8221
3laporte180.80740.69630.8132
4vaneerden18 --> nieuwenhuis180.81160.61000.8095
5schuur180.8074-0.8074
6miranda180.81210.54680.8068
7sierraloaiza180.80110.74420.8063
8aragonsaenzpardo180.79630.69210.8016
9miller180.79110.51740.7947
10snijders180.7926-0.7926
11martinc180.79000.58260.7926
12yigal180.79110.49420.7889
13pool180.78530.65840.7884
14lopezsantillan180.7847-0.7847
15vondaeniken180.7742-0.7742
16gouravdas180.75580.67470.7737
17bayot180.7716-0.7716
18schaetti180.77110.57630.0.7711
19tekir180.7495-0.7495
20gariboiorts180.7363-0.7363
21raiyani180.7279-0.7279
22sandroni180.6658-0.6658
23karlgren180.5521-0.5521
SPANISH RANKING
RANKTEAMTEXTIMAGESCOMBINED
1daneshvar180.8200-0.8200
2takahashi180.78640.77320.8159
3snijders180.8036-0.8036
4laporte180.79590.68050.8000
5miranda180.80050.56910.7955
6vaneerden18 --> nieuwenhuis180.80270.58730.7923
7schuur180.7918-0.7918
8martinc180.77820.54860.7786
9aragonsaenzpardo180.76860.66680.7723
10gouravdas180.75860.69180.7709
11lopezsantillan180.7677-0.7677
12miller180.76500.49230.7623
13yigal180.76500.50270.7591
14sierraloaiza180.78270.71000.7477
15vondaeniken180.7464-0.7464
16pool180.74050.62320.7432
17schaetti180.73590.57820.7359
18gariboiorts180.7164-0.7164
19bayot180.6873-0.6873
20sandroni180.6782-0.6782
21tekir180.6655-0.6655
22raiyani180.6436-0.6436
23karlgren18---
Related Work and Corpora

We refer you to:

Task Chair

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Task Committee

Francisco Rangel

Francisco Rangel

Autoritas Consulting

Manuel Montes-y-Gómez

Manuel Montes-y-Gómez

INAOE - Puebla

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

© pan.webis.de