Author Masking 2017

Task

Given a document, paraphrase it so that its writing style does not match that of its original author, anymore.

Input

To develop your software, we provide you with a training corpus that consists of documents that are to be obfuscated, and other documents from the same author.

Download corpus

Output

For each document to be obfuscated, your software shall output a detailed list how each piece of original text has been paraphrased. The output shall be formatted using JSON as follows:

[
  {
    "original": "The quick brown fox jumps over the lazy dog.",
    "original-start-charpos": 10,
    "original-end-charpos": 55,
    "obfuscation": "Lazy lay the dog when an auburn fox quickly jumped over him.",
    "obfuscation-id": 1
  },
  {
    "original": "Squdgy fez, blank jimp crwth vox!",
    "original-start-charpos": 56,
    "original-end-charpos": 70,
    "obfuscation": "A short brimless felt hat barely blocks out the sound of a Celtic violin!",
    "obfuscation-id": 2
  },
  ...
]

The original must be reproduced exactly as it appears in the original document. It can be longer than a sentence, however, it should be less than about 50 words total.

The original-start-charpos shall mark the character position in the original document where the original string starts, and the original-end-charpos the character position where the original ends. Keep in mind that some documents may contain UTF-8 characters when calculating the character position of the originals.

The obfuscation-id must be an int ranging from 1 to n, where n is the number of obfuscations.

The concatenation of all obfuscation entries shall form a coherent text that obfuscates the original text. The concatenation will be done in oder of ascending obfuscation-id.

JSON disallows certain special characters in strings (including line breaks). Please make sure to escape these characters in order to output valid JSON. Refer to the JSON spec for further details, or use a JSON library.

Evaluation

We call an obfuscation software

  • safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
  • sound, if its obufscated texts are textually entailed with their originals, and
  • sensible, if its obfuscated texts are inconspicuous.

These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.

The performance of an obfuscation software is measured

  • using automatic authorship verifiers to measure safety, and
  • with manual peer-review to assess soundness and sensibleness.

To measure safety we will use the following authorship verification software:

  • Caravel [code] (best-performing approach at PAN 2015)
  • GLAD [code]
  • Authorid [code]
  • AuthorIdentification-PFP [code]

To measure soundness and sensibleness, obfuscations will be sampled and handed out to participants for peer-review.

Finally, we invite all participants to submit automatic performance measures (see the corresponding task Obfuscation Evaluation).

Task Committee