Author Obfuscation
2016

Adobe
Sponsor

This task is divided into author masking and obfuscation evaluation. You can choose to solve one or both of them.

Author Masking

Task

Given a document, paraphrase it so that its writing style does not match that of its original author, anymore.

Training Corpus

To develop your software, we provide you with a training corpus that consists of documents that are to be obfuscated, and other documents from the same author.

Download corpus

Output

For each document to be obfuscated, your software shall output a detailed list how each piece of original text has been paraphrased. The output shall be formatted using JSON as follows:

[
  {
    "original": "The quick brown fox jumps over the lazy dog.",
    "original-start-charpos": 10,
    "original-end-charpos": 55,
    "obfuscation": "Lazy lay the dog when an auburn fox quickly jumped over him.",
    "obfuscation-id": 1
  },
  {
    "original": "Squdgy fez, blank jimp crwth vox!",
    "original-start-charpos": 56,
    "original-end-charpos": 70,
    "obfuscation": "A short brimless felt hat barely blocks out the sound of a Celtic violin!",
    "obfuscation-id": 2
  },
  ...
]

The original must be reproduced exactly as it appears in the original document. It can be longer than a sentence, however, it should be less than about 50 words total.

The original-start-charpos shall mark the character position in the original document where the original string starts, and the original-end-charpos the character position where the original ends. Keep in mind that some documents may contain UTF-8 characters when calculating the character position of the originals.

The obfuscation-id must be an int ranging from 1 to n, where n is the number of obfuscations.

The concatenation of all obfuscation entries shall form a coherent text that obfuscates the original text. The concatenation will be done in oder of ascending obfuscation-id.

JSON disallows certain special characters in strings (including line breaks). Please make sure to escape these characters in order to output valid JSON. Refer to the JSON spec for further details, or use a JSON library.

Performance Measures

We call an obfuscation software

  • safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
  • sound, if its obufscated texts are textually entailed with their originals, and
  • sensible, if its obfuscated texts are inconspicuous.

These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.

The performance of an obfuscation software is measured

  • using automatic authorship verifiers to measure safety, and
  • with manual peer-review to assess soundness and sensibleness.

To measure safety we will use the following authorship verification software:

  • Caravel [code] (best-performing approach at PAN 2015)
  • GLAD [code]
  • Authorid [code]
  • AuthorIdentification-PFP [code]

To measure soundness and sensibleness, obfuscations will be sampled and handed out to participants for peer-review.

Finally, we invite all participants to submit automatic performance measures (see the corresponding task Obfuscation Evaluation).

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/corpus -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Obfuscation Evaluation

Task

We call an obfuscation software

  • safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
  • sound, if its obufscated texts are textually entailed with their originals, and
  • sensible, if its obfuscated texts are inconspicuous.

These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.

The task is to devise and implement performance measures that quantify any or parts of these aspects of an obfuscation software.

Input

We will provide you with the data generated by submitted obfuscation software as soon as it becomes available.

The input format will be the same as the output of the author masking task.

Output

The output of an evaluation software should be formatted as follows:

measure {
  key  : "myMeasure"
  value: "0.567"
}
measure {
  key  : "myOtherMeasure"
  value: "1.5789"
}
measure {
  key  : "myThirdMeasure"
  value: "0.98421"
}
...

The output is formatted as ProtoBuf text, not JSON.

key can be any string that clearly and concisely names the performance measure.

value shall be a numeric quantification of the measure for a given obfuscated text.

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/run/directory -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Task Chair

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Matthias Hagen

Matthias Hagen

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar