Author Masking 2016
Synopsis
- Task: Given a document, paraphrase it so that its writing style does not match that of its original author, anymore.
- Input: [data]
- Submission: [submit]
Input
To develop your software, we provide you with a training corpus that consists of documents that are to be obfuscated, and other documents from the same author.
Output
For each document to be obfuscated, your software shall output a detailed list how each piece of original text has been paraphrased. The output shall be formatted using JSON as follows:
[
{
"original": "The quick brown fox jumps over the lazy dog.",
"original-start-charpos": 10,
"original-end-charpos": 55,
"obfuscation": "Lazy lay the dog when an auburn fox quickly jumped over him.",
"obfuscation-id": 1
},
{
"original": "Squdgy fez, blank jimp crwth vox!",
"original-start-charpos": 56,
"original-end-charpos": 70,
"obfuscation": "A short brimless felt hat barely blocks out the sound of a Celtic violin!",
"obfuscation-id": 2
},
...
]
The original
must be reproduced exactly as it appears in the original document.
It can be longer than a sentence, however, it should be less than about 50 words total.
The original-start-charpos
shall mark the character position in the original
document where the original
string starts, and the
original-end-charpos
the character position where the original
ends. Keep in mind that some documents may contain UTF-8 characters when calculating the
character position of the originals.
The obfuscation-id
must be an int
ranging from 1 to n, where n is
the number of obfuscations.
The concatenation of all obfuscation
entries shall form a coherent text that
obfuscates the original text. The concatenation will be done in oder of ascending obfuscation-id
.
JSON disallows certain special characters in strings (including line breaks). Please make sure to escape these characters in order to output valid JSON. Refer to the JSON spec for further details, or use a JSON library.
Evaluation
We call an obfuscation software
- safe, if a forensic analysis does not reveal the original author of its obfuscated texts,
- sound, if its obufscated texts are textually entailed with their originals, and
- sensible, if its obfuscated texts are inconspicuous.
These dimensions are orthogonal; an obfuscation software may meet any of them to various degrees of perfection.
The performance of an obfuscation software is measured
- using automatic authorship verifiers to measure safety, and
- with manual peer-review to assess soundness and sensibleness.
To measure safety we will use the following authorship verification software:
- Caravel [code] (best-performing approach at PAN 2015)
- GLAD [code]
- Authorid [code]
- AuthorIdentification-PFP [code]
To measure soundness and sensibleness, obfuscations will be sampled and handed out to participants for peer-review.
Finally, we invite all participants to submit automatic performance measures (see the corresponding task Obfuscation Evaluation).