Author Profiling
2018

MeaningCloud
Sponsor

Authorship analysis deals with the classification of texts into classes based on the stylistic choices of their authors. Beyond the author identification and author verification tasks where the style of individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how language is shared by people. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of growing importance in applications in forensics, security, and marketing. E.g., from a forensic linguistics perspective one would like being able to know the linguistic profile of the author of a harassing text message (language used by a certain type of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and online product reviews, the demographics of people that like or dislike their products. The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and personality processes.

Task

This year the focus will be on gender identification in Twitter, where text and images may be used as information sources. The languages addressed will be:

  • English
  • Spanish
  • Arabic
  • Although we suggest to participate in all the languages, it is possible participating only in some of them. Similarly, although we suggest to participate in both textual and image classification, it is possible participating only in some of them.
Training corpus
To develop your software, we provide you with a training data set that consists of Twitter users labeled with gender. For each author, a total of 100 tweets and 10 images are provided. Authors are grouped by the language of their tweets: English, Arabic and Spanish.

Download corpus (Updated February 27, 2018)

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

  <author id="author-id"
	  lang="en|es|ar"
	  gender_txt="male|female"
	  gender_img="male|female"
	  gender_comb="male|female"
  />
  

We ask you to provide with three different predictions for the author's gender depending on your approach:

  • gender_txt: gender prediction by using only text
  • gender_img: gender prediction by using only images
  • gender_comb: gender prediction by using both text and images

As previously said, you can participate in both textual and images classification, or only in one of them. Hence, if your approach uses only textual features, your prediction should be given in gender_txt. Similarly, if your approach relies on images, your prediction should be given in gender_img. In case you use both text and images, your prediction should be given in gender_comb. Furthermore, in such a case, if you can provide also the prediction by using both approaches separately, this would allow us to perform a more in-depth analysis of the results and to compare textual vs. image based author profiling. In this case, you should provide for the same author the three predictions: gender_txt, gender_img and gender_comb.

The naming of the output files is up to you, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

Performance Measures

The performance of your author profiling solution will be ranked by accuracy.

For each language, we will calculate individual accuracies. Then, we will average the accuracy values per language to obtain the final ranking.

Submission

We ask you to prepare your software so that it can be executed via command line calls. More details will be released here soon.

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Related Work and Corpora

We refer you to:

Task Chair

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València

Task Committee

Francisco Rangel

Francisco Rangel

Autoritas Consulting

Manuel Montes-y-Gómez

Manuel Montes-y-Gómez

INAOE - Puebla

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

© pan.webis.de