Author Obfuscation
2018

This task is divided into author masking and obfuscation evaluation. You can choose to solve one or both of them.

Author Masking

Task

Given a document, paraphrase it so that its writing style does not match that of its original author, anymore.

Training Corpus

To develop your software, we provide you with a training corpus that consists of documents that are to be obfuscated, and other documents from the same author.

Download corpus

Output

For each document to be obfuscated, your software shall output a detailed list how each piece of original text has been paraphrased. The output shall be formatted using JSON as follows:

[
  {
    "original": "The quick brown fox jumps over the lazy dog.",
    "original-start-charpos": 10,
    "original-end-charpos": 55,
    "obfuscation": "Lazy lay the dog when an auburn fox quickly jumped over him.",
    "obfuscation-id": 1
  },
  {
    "original": "Squdgy fez, blank jimp crwth vox!",
    "original-start-charpos": 56,
    "original-end-charpos": 70,
    "obfuscation": "A short brimless felt hat barely blocks out the sound of a Celtic violin!",
    "obfuscation-id": 2
  },
  ...
]

The original must be reproduced exactly as it appears in the original document. It can be longer than a sentence, however, it should be less than about 50 words total.

The original-start-charpos shall mark the character position in the original document where the original string starts, and the original-end-charpos the character position where the original ends. Keep in mind that some documents may contain UTF-8 characters when calculating the character position of the originals.

The obfuscation-id must be an int ranging from 1 to n, where n is the number of obfuscations.

The concatenation of all obfuscation entries shall form a coherent text that obfuscates the original text. The concatenation will be done in oder of ascending obfuscation-id.

JSON disallows certain special characters in strings (including line breaks). Please make sure to escape these characters in order to output valid JSON. Refer to the JSON spec for further details, or use a JSON library.

Performance Measures

We call an obfuscation software

  • safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
  • sound, if its obufscated texts are textually entailed with their originals, and
  • sensible, if its obfuscated texts are inconspicuous.

These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.

The performance of an obfuscation software is measured

  • using automatic authorship verifiers to measure safety, and
  • with manual peer-review to assess soundness and sensibleness.

To measure safety we will use the following authorship verification software:

  • Caravel [code] (best-performing approach at PAN 2015)
  • GLAD [code]
  • Authorid [code]
  • AuthorIdentification-PFP [code]

To measure soundness and sensibleness, obfuscations will be sampled and handed out to participants for peer-review.

Finally, we invite all participants to submit automatic performance measures (see the corresponding task Obfuscation Evaluation).

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/corpus -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Obfuscation Evaluation

Task

We call an obfuscation software

  • safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
  • sound, if its obufscated texts are textually entailed with their originals, and
  • sensible, if its obfuscated texts are inconspicuous.

These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.

The task is to devise and implement performance measures that quantify any or parts of these aspects of an obfuscation software.

Input

We will provide you with the data generated by submitted obfuscation software as soon as it becomes available.

The input format will be the same as the output of the author masking task.

Output

The output of an evaluation software should be formatted as follows:

measure {
  key  : "myMeasure"
  value: "0.567"
}
measure {
  key  : "myOtherMeasure"
  value: "1.5789"
}
measure {
  key  : "myThirdMeasure"
  value: "0.98421"
}
...

The output is formatted as ProtoBuf text, not JSON.

key can be any string that clearly and concisely names the performance measure.

value shall be a numeric quantification of the measure for a given obfuscated text.

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/run/directory -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Performance Measures

Performance will be measured by assessing the validity of your corpus in two ways.

Detection: Your corpus will be fed into the text alignment prototypes that have been submitted in previous years to the text alignment task. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured using macro-averaged precision and recall, granularity, and the plagdet score.

Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review. Every participant will be given a chance to assess and analyze the corpora of all other participants in order to determine corpus quality.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your corpus, put it in a ZIP archive, and make it available to us via a file sharing service of your choosing, e.g., Dropbox, or Mega.

Task Chair

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Matthias Hagen

Matthias Hagen

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar