Profiling Cryptocurrency Influencers with Few-shot Learning

Sponsored by
Symanto Research

Synopsis

  • Task: In this shared task we aim to profile cryptocurrency influencers in social media, from a low-resource perspective. Moreover, we propose to categorize other related aspects of the influencers, also using a low-resource setting. Specifically, we focus on English Twitter posts for three different sub-tasks:
    1. Low-resource influencer profiling (subtask1):
      • Input [data]:
        32 users per label with a maximum of 10 English tweets each.
        Classes: (1) null, (2) nano, (3) micro, (4) macro, (5) mega
      • March 6, 2023 - Version 1.1 New dataset version is available.
      • Oficial evaluation metric: Macro F1
      • Submission: TIRA.
      • Baselines: User-character Logistic Regression; t5-large (bi-encoders) - zero shot [7], t5-large (label tuning) - few shot [7]
    2. Low-resource influencer interest identification (subtask2):
      • Input [data]:
        64 users per label with 1 English tweets each.
        Classes: (1) technical information, (2) price update, (3) trading matters, (4) gaming, (5) other
      • Oficial evaluation metric: Macro F1
      • Submission: TIRA.
      • Baselines: User-character Logistic Regression; t5-large (bi-encoders) - zero shot [7], t5-large (label tuning) - few shot [7]
    3. Low-resource influencer intent identification (subtask3):
      • Input [data]:
        64 users per label with 1 English tweets each.
        Classes: (1) subjective opinion, (2) financial information, (3) advertising, (4) announcement
      • Oficial evaluation metric: Macro F1
      • Submission: TIRA.
      • Baselines: User-character Logistic Regression; t5-large (bi-encoders) - zero shot [7], t5-large (label tuning) - few shot [7]
  • Participations for independent tasks, both from machine and deep learning pespective, are welcome.
  • Task

    Data annotation for Natural Language Processing (NLP) is a challenging task. Aspects such as the economic and temporal cost, the psychological and linguistic expertise needed by the annotator, and the congenital subjectivity involved in the annotation task, makes it difficult to obtain large amounts of high quality data [1, 2].
    Cryptocurrencies have massively increased their popularity in recent years [3]. Aspects such as not being reliant on any central authority, the possibilities offered by the different projects, and the new gold rush, spread mainly by influencers, make this a very trendy topic in social media. However, in a real environment where, for instance, traders may want to leverage social media signals to forecast the market, data collection is a challenge and real-time profiling needs to be done in a few milliseconds, which implies to process as little data as possible. Participants will be provided with little training data per task, and will need to choose carefully the models applied to this under-resource setting. Concepts such as transfer learning [4] and few-shot learning [5,6,7,8] will be key to excel.

    Award

    tba.

    Data

    Input

    The dataset format is the same for each sub task. The uncompressed dataset consists in a folder which contains two JSON files:

    • A train-truth.json file with the list of authors and the ground truth.
    •  
      {"twitter user id":"05ca545f2f700d0d5c916657251d010b","texts":[{"text":"I got $20 on Boston winning tonight, who trying to bet? \ud83d\udc40"},{"text":"Is there an alternative search engine besides Google? I hate when I search a question and the answer has absolutely nothing to do with the question"}],"tweet ids":[{"tweet id":"65408feeb147b509e4bc47280c062e16"},{"tweet id":"10a27a0fae34f8411a2ed3b0631db42d"}]}
      {"twitter user id":"062492818c984febba843b650a4a602e","texts":[{"text":"@1inch, my favorite aggregator has a sweet booth this year, if you ignore the glare lol https:\/\/t.co\/BJdG60wpKR"},{"text":"Takes 2 $Matic or 100 $BANK on polygon. That's one of the lowest cost ways to now flex commitment to community, and value the work of your peers in so doing."}],"tweet ids":[{"tweet id":"841834bda2a5703a27a9e2a2e0e11471"},{"tweet id":"6539be1639225a7b362e71dba7dcf18a"}]}
                      
    • A truth.json file with the list of authors and the ground truth.
       
      {"twitter user id":"0003d5772f14b3147659f37b5aa4399e","class":"no influencer"}
      {"twitter user id":"00230caa0289b84a7a077457435d26b8","class":"macro"}
                      

    Output

    Participants software must take as input the absolute path to an unpacked dataset. The output JSON file looks like this:

    {"twitter user id":"0003d5772f14b3147659f37b5aa4399e","class":"no influencer", "probability": 1.0}
    {"twitter user id":"00230caa0289b84a7a077457435d26b8","class":"macro", "probability": 0.5} 
                    

    The output file naming is up to participants. However, we recommend to use the "subtask-id" eg; "subtask1" as filename and "json" as extension.

    Evaluation

    The official evaluation metric is Macro F1. We will also analyse per-class accuracy, precision, recall and f-measure to show the participant performance regarding.

    Results SubTask 1: Low-resource influencer profiling

    POS Team Macro-F1

    Results SubTask 2:Low-resource influencer interest identification

    POS Team Macro-F1

    Results SubTask 3: Low-resource influencer intent identification

    POS Team Macro-F1

    Task Committee