PAN 2020 now live!

Bots and Gender Profiling

PAN @ CLEF 2019
The Logic Value
Sponsor

Given a Twitter feed, determine whether its author is a bot or a human. In case of human, identify her/his gender.

Task

Social media bots pose as humans to influence users with commercial, political or ideological purposes. For example, bots could artificially inflate the popularity of a product by promoting it and/or writing positive ratings, as well as undermine the reputation of competitive products through negative valuations. The threat is even greater when the purpose is political or ideological (see Brexit referendum or US Presidential elections). Fearing the effect of this influence, the German political parties have rejected the use of bots in their electoral campaign for the general elections. Furthermore, bots are commonly related to fake news spreading. Therefore, to approach the identification of bots from an author profiling perspective is of high importance from the point of view of marketing, forensics and security.

After having addressed several aspects of author profiling in social media from 2013 to 2018 (age and gender, also together with personality, gender and language variety, and gender from a multimodality perspective), this year we aim at investigating whether the author of a Twitter feed is a bot or a human. Furthermore, in case of human, to profile the gender of the author.

 

As in previous years, we propose the task from a multilingual perspective:

  • English
  • Spanish
NOTE: Although we recommend to participate in both, bots and gender profiling, it is possible to address just one problem, and for just one language: English or Spanish.

Unlike previous years and with the aim at maintaining a realistic scenario, we have not performed any cleaning action on the tweets: they remain as users tweeted them. This means that RTs have not been removed and tweets potentially in more than one language can appear.


Award

We are happy to announce that the best performing team at the 7th International Competition on Author Profiling will be awarded 300 Euro sponsored by The Logic Value.

Juan Pizarro, Universitat Politècnica de València, Spain.

Congratulations!


Data [Download Training Data]

Input

The uncompressed dataset consists in a folder per language (en, es). Each folder contains:
  • A XML file per author (Twitter user) with 100 tweets. The name of the XML file correspond to the unique author id.
  • A truth.txt file with the list of authors and the ground truth.
The format of the XML files is:
	<author lang="en">
	    <documents>
		    <document>Tweet 1 textual contents</document>
		    <document>Tweet 2 textual contents</document>
		    ...
		</documents>
	</author>
                              
The format of the truth.txt file is as follows. The first column corresponds to the author id. The second and third columns contain the truth for the human/bot and bot/male/female tasks.
	b2d5748083d6fdffec6c2d68d4d4442d:::bot:::bot
	2bed15d46872169dc7deaf8d2b43a56:::bot:::bot
	8234ac5cca1aed3f9029277b2cb851b:::human:::female
	5ccd228e21485568016b4ee82deb0d28:::human:::female
	60d068f9cafb656431e62a6542de2dc0:::human:::male
	c6e5e9c92fb338dc0e029d9ea22a4358:::human:::male
	...
                              

Output

Your software must take as input the absolute path to an unpacked dataset, and has to output for each document of the dataset a corresponding XML file that looks like this:

	<author id="author-id"
		lang="en|es"
		type="bot|human"
		gender="bot|male|female"
	/>
                              

The naming of the output files is up to you. However, we recommend to use the author-id as filename and "xml" as extension.

IMPORTANT! Languages should not be mixed. A folder should be created for each language and place inside only the files with the prediction for this language.

IMPORTANT! In order to avoid overfitting when experimenting with the training set, we recommend you to use the provided split train/dev (files truth-train.txt and truth-dev.txt).

Evaluation

The performance of your author profiling solution will be ranked by accuracy. For each language, we will calculate individual accuracies. Firstly, we will calculate the accuracy of identifying bots vs. human. Then, in case of humans, we will calculate the accuracy of identifying males vs. females. Finally, we will average the accuracy values per language to obtain the final ranking.

Submission [Submit software to PAN]

This task follows PAN's software submission strategy described here.

Related Work

Results

The following tables list the performances achieved by the participating teams in the different subtasks:

BOTS vs. HUMANGENDER
POSTeamENESENESAVG
1Pizarro0.93600.93330.83560.81720.8805
2Srinivasarao & Manu0.93710.90610.83980.79670.8699
3Bacciu et al.0.94320.90780.84170.77610.8672
4Jimenez-Villar et al.0.91140.92110.82120.81000.8659
5Fernquist0.94960.90610.82730.76670.8624
6Mahmood0.91210.91670.81630.79500.8600
7Ipsas & Popescu0.93450.89500.82650.78220.8596
8Vogel & Jiang0.92010.90560.81670.77560.8545
9Johansson & Isbister0.95950.88170.83790.72780.8517
10Goubin et al.0.90340.86780.83330.79170.8491
11Polignano & de Pinto0.91820.91560.79730.74170.8432
12Valencia et al.0.90610.86060.84320.75390.8410
13Kosmajac & Keselj0.92160.89560.79280.74940.8399
14Fagni & Tesconi0.91480.91440.76700.75890.8388
char nGrams0.93600.89720.79200.72890.8385
15Glocker0.90910.87670.81140.74670.8360
word nGrams0.93560.88330.79890.72440.8356
16Martinc et al.0.89390.87440.79890.75720.8311
17Sanchis & Velez0.91290.87560.80610.72330.8295
18Halvani & Marquardt0.91590.82390.82730.73780.8262
19Ashraf et al.0.92270.88390.75830.72610.8228
20Gishamer0.93520.79220.84020.71220.8200
21Petrik & Chuda0.90080.86890.77580.72500.8176
22Oliveira et al.0.90570.87670.76860.71500.8165
W2V0.90300.84440.78790.71560.8127
23De La Peña & Prieto0.90450.85780.78980.69670.8122
24López Santillán et al.0.88670.85440.77730.71000.8071
LDSE0.90540.83720.78000.69000.8032
25Bolonyai et al.0.91360.83890.75720.69560.8013
26Moryossef0.89090.83780.78710.68940.8013
27Zhechev0.86520.87060.73600.71780.7974
28Giachanou & Ghanem0.90570.85560.77310.64780.7956
29Espinosa et al.0.84130.76830.84130.71780.7922
30Rahgouy et al.0.86210.83780.76360.70220.7914
31Onose et al.0.89430.84830.74850.67110.7906
32Przybyla0.91550.88440.68980.65330.7858
33Puertas et al.0.88070.80610.76100.69440.7856
34Van Halteren0.89620.82830.74200.67280.7848
35Gamallo & Almatarneh0.81480.87670.72200.70560.7798
36Bryan & Philipp0.86890.78830.64550.60560.7271
37Dias & Paraboni0.84090.82110.58070.64670.7224
38Oliva & Masanet0.91140.91110.44620.45890.6819
39Hacohen-Kerner et al.0.41630.47440.74890.73780.5944
40Kloppenburg0.58300.53890.46780.44830.5095
MAJORITY0.50000.50000.50000.50000.5000
RANDOM0.49050.48610.37160.37000.4296
41Bounaama & Amine0.50080.50500.25110.25670.3784
42Joo & Hwang0.9333-0.8360-0.4423
43Staykovski0.9186-0.8174-0.4340
44Cimino & Dell'Orletta0.9083-0.7898-0.4245
45Ikae et al.0.9125-0.7371-0.4124
46Jeanneau0.8924-0.7451-0.4094
47Zhang0.8977-0.7197-0.4044
48Fahim et al.0.8629-0.6837-0.3867
49Saborit-0.8100-0.65670.3667
50Saeed & Shirazi0.7951-0.5655-0.3402
51Radarapu0.7242-0.4951-0.3048
52Bennani-Smires0.9159---0.2290
53Gupta0.5007-0.4044-0.2263
54Qurdina0.9034---0.2259
55Aroyehun0.5000---0.1250

BASELINES SETUP

  • MAJORITY: The predicted class coincides with the majority class.
  • RANDOM: A random prediction for each instance.
  • CHAR N-GRAMS:
    • BOTS-EN: 500 characters 5-grams + Random Forest
    • BOTS-ES: 2,000 characters 5-grams + Random Forest
    • GENDER-EN: 2,000 characters 4-grams + Random Forest
    • GENDER-ES: 1,000 characters 5-grams + Random Forest
  • WORD N-GRAMS:
    • BOTS-EN: 200 words 1-grams + Random Forest
    • BOTS-ES: 100 words 1-grams + Random Forest
    • GENDER-EN: 200 words 1-grams + Random Forest
    • GENDER-ES: 200 words 1-grams + Random Forest
  • WORD EMBEDDINGS: Text represented by averaging the word embeddings.
    • BOTS-EN: glove.twitter.27B.200d + Random Forest
    • BOTS-ES: fasttext-wikipedia + J48
    • GENDER-EN: glove.twitter.27B.100d + SVM
    • GENDER-ES: fasttext-sbwc + SVM
  • LDSE: Low Dimensionality Statistical Embedding described in: Rangel, F., Rosso, P., Franco, M. A Low Dimensionality Representation for Language Variety Identification. In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer-Verlag, LNCS(9624), pp. 156-169, 2018
    • BOTS-EN: LDSE.v2 (MinFreq=10, MinSize=1) + Naive Bayes
    • BOTS-ES: LDSE.v1 (MinFreq=10, MinSize=1) + Naive Bayes
    • GENDER-EN: LDSE.v1 (MinFreq=10, MinSize=3) + BayesNet
    • GENDER-ES: LDSE.v1 (MinFreq=2, MinSize=1) + Naive Bayes