Celebrity Profiling 2019
Synopsis
- Task: Given a celebrity's Twitter feed, determine its owner's age, fame, gender, and occupation.
- Input: [data]
- Output: [verifier]
- Evaluation: [code]
- Submission: [submit]
- Baselines: none.
Task
Celebrities are among the most prolific users of social media, promoting their personas and rallying followers. This activity is closely tied to genuine writing samples, rendering them worthy research subjects in many respects, not least author profiling.
The Celebrity Profiling task this year is to predict four traits of a celebrity from their social media communication. The traits are the degree of fame, occupation, age, and gender. The social media communication is given as the teaser messages from past tweets. The goal is to develop a piece of software which predicts celebrity traits from the teaser history.
Total Dataset Size | 48,335 User Profiles |
---|---|
text size | 2,181 Tweets avg. per User |
novel traits | Fame and Occupation |
New Attributes | Detailed Birthyears and Nonbinary Gender. |
Data
The training dataset contains of two files: a feeds.ndjson
as input and a labels.ndjson
as output. Each file lists all celebrities as JSON objects, one per line and identified by the id
key.
Input Format
The input file contains the cid and a list of all teaser messages for each celebrity.
{"id": 1234, "text": ["a tweet", "another tweet", ...]}
{"id": 5678, "text": ["a tweet", "another tweet", ...]}
...
Output Format
The output file contains the cid and and a value for each trait for each celebrity from the input file.
{"id": 1234, "fame": "star", "occupation": "sports", "gender": "female", "birthyear": 2002}
{"id": 5678, "fame": "rising", "occupation": "professional", "gender": "male", "birthyear": 1990}
...
fame := {rising, star, superstar}
occupation := {sports, performer, creator, politics, manager,
science, professional, religious}
birthyear := {1940, ..., 2012}
gender := {male, female, nonbinary}
Evaluation
Submissions are judged by a combined metric cRank
, which is the harmonic mean of each label's metric.
$$ \text{cRank} = {4 \over {\frac{1}{\text{F}_{1, \text{fame}}} + \frac{1}{\text{F}_{1,
\text{occupation}}} + \frac{1}{\text{F}_{1, \text{gender}}} +
\frac{1}{\text{F}_{1, \text{age}}}}} $$
All traits are judged by their respective F1. Precision and recall of birthyear
are calculated leniently. If a prediction is within an m-window of the truth, it is counted as correct:
$$ \text{true birthyear} - m \le \text{predicted birthyear} \le \text{true birthyear} + m$$
The window size m is based on the birth year and increases linearly from about 2 years for 2012 to about 9 years for 1940.
Submission
For evaluation, your software will read a feeds.ndjson
file from a given directory
and write a valid labels.ndjson
with your predictions to a given output directory.
Results
team | test-dataset1 | test-dataset2 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
cRank | gender | age | fame | occup | cRank | gender | age | fame | occup | |
radivchev | 0.593 | 0.726 | 0.618 | 0.551 | 0.515 | 0.559 | 0.609 | 0.657 | 0.548 | 0.461 |
morenosandoval | 0.541 | 0.644 | 0.518 | 0.563 | 0.469 | 0.497 | 0.561 | 0.516 | 0.518 | 0.418 |
martinc | 0.462 | 0.580 | 0.361 | 0.517 | 0.449 | 0.465 | 0.594 | 0.347 | 0.507 | 0.486 |
fernquist | 0.424 | 0.447 | 0.339 | 0.493 | 0.449 | 0.413 | 0.465 | 0.467 | 0.482 | 0.300 |
petrik | 0.377 | 0.595 | 0.255 | 0.480 | 0.340 | 0.441 | 0.555 | 0.360 | 0.526 | 0.385 |
asif | - | - | - | - | - | 0.402 | 0.588 | 0.254 | 0.504 | 0.427 |
pelzer | 0.331 | 0.244 | 0.418 | 0 | 0.178 | |||||
bryan | - | - | - | - | - | 0.231 | 0.335 | 0.207 | 0.289 | 0.165 |
baseline-rand | 0.223 | 0.344 | 0.123 | 0.341 | 0.125 | - | - | - | - | - |
baseline-uniform | 0.138 | 0.266 | 0.117 | 0.099 | 0.152 | - | - | - | - | - |
baseline-majority | 0.136 | 0.278 | 0.071 | 0.285 | 0.121 | - | - | - | - | - |
Related Work
- Matti Wiegmann, Benno Stein, Martin Potthast. Celebrity Profiling. To appear in 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), July 2019. Association for Computational Linguistics.