Plagiarism Detection
2014

This task is divided into source retrieval and text alignment. You can choose to solve one or both of them.

Source Retrieval

Task
Given a suspicious document and a web search API, your task is to retrieve all plagiarized sources while minimizing retrieval costs.
Training Corpus

To develop your software, we provide you with a training corpus that consists of suspicious documents. Each suspicious document is about a specific topic and may consist of plagiarized passages obtained from web pages on that topic found in the ClueWeb09 corpus.

Learn more » Download corpus

API

If you are not in possession of the ClueWeb09 corpus, we also provide access to two search engines which index the ClueWeb, namely the Lemur Indri search engine and the ChatNoir search engine. To programmatically access these two search engines, we provide a unified search API.

Learn more »

Note: To better separate the source retrieval task from the text alignment task, the API provides a text alignment oracle feature. For each document you request to download from the ClueWeb, the text alignment oracle discloses if this document is a source for plagiarism for the suspicious document in question. In addition, the plagiarized text is returned. This, way participation in the source retrieval task does not require the development of a text alignment solution. However, you are free to use your own text alignment, if you want to.

Baseline

For your convenience, we provide a baseline program written in Python.

Download program

The program loops through the suspicious documents in a given directory and outputs a search interaction log. The log is valid with respect to the output format described below. You may use the source code for getting started with your own approach.

Output

For each suspicious document suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an interaction log suspicious-documentXYZ.log which logs meta information about your retrieval process:

Timestamp   [Query|Download_URL]
1358326592  barack obama family tree
1358326597  http://webis15.medien.uni-weimar.de/proxy/clueweb/id/110212744
1358326598  http://webis15.medien.uni-weimar.de/proxy/clueweb/id/10221241
1358326599  http://webis15.medien.uni-weimar.de/proxy/clueweb/id/100003305377
1358326605  barack obama genealogy
1358326610  http://webis15.medien.uni-weimar.de/proxy/clueweb/id/82208332
...

For example, the above file would specify that at 1358326592 (Unix timestamp) the query barack obama family tree was sent and that in the following three of the retrieved documents were selected for download before the next query was sent.

Performance Measures

Performance will be measured based on the following five scores as averages over each suspicious document:

  1. Number of queries submitted.
  2. Number of web pages downloaded.
  3. Precision and recall of web pages downloaded regarding actual sources of a suspicious document.
  4. Number of queries until the first actual source is found.
  5. Number of downloads until the first actual source is downloaded.

Measures 1-3 capture the overall behavior of a system and measures 4-5 assess the time to first result. The quality of identifying reused passages between documents is not taken into account here, but note that retrieving duplicates of a source document is considered a true positive, whereas retrieving more than one duplicate of a source document does not improve performance.

Learn more »

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/corpus -o path/to/output/directory -t accessToken

You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, your can move to submit your software. Before doing so, we provide your with a software submission readiness tester. Please use this tester to verify that your software works. Since we will be calling your software automatically in much the same ways as the tester does, this lowers the risk ot errors.

Download PAN Software Submission Readiness Tester

When your software is submission-ready, please mail the filled out submission.txt file found along the software submission readiness tester to pan@webis.de.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following table lists the performances achieved by the participating teams:

Source Retrieval Performance
Workload to 1st DetectionDownloaded SourcesTeam
QueriesDownloadsPrecisionRecall
54.5 33.20.400.39Victoria Elizalde
Private, Argentina
83.5207.10.080.48Leilei Kong, Yong Han, Zhongyuan Han, Haihao Yu, Qibo Wang, Tinglei Zhang , Haoliang Qi
Heilongjiang Institute of Technology, China
60.0 38.80.380.51Amit Prakash and Sujan kumar Saha
Birla Institute of Technology, India
19.5237.30.080.40Šimon Suchomel and Michal Brandejs
Masaryk University, Czech Republic
117.1 14.40.570.48Kyle Williams, Hung-Hsuan Chen, and C. Lee Giles
Pennsylvania State University, USA
37.0 18.60.540.45Denis Zubarev° and Ilya Sochenkov*
°Institute for Systems Analysis of Russian Academy of Sciences and *Peoples’ Friendship University of Russia, Russia

A more detailed analysis of the retrieval performances can be found in the overview paper accompanying this task.

Learn more »

Related Work

This task has been run since PAN'12; here is a quick list of the respective proceedings and overviews:

Text Alignment

Task
Given a pair of documents, your task is to identify all contiguous maximal-length passages of reused text between them.
Training Corpus

To develop your software, we provide you with a training corpus that consists of pairs of documents, one of which may contain passages of text resued from the other. The reused text is subject to various kinds of (automatic) obfuscation to hide the fact it has been reused.

Learn more » Download corpus

Baseline

For your convenience, we provide a baseline program written in Python.

Download program

The program loops through the document pairs of a corpus and records the detection results in XML files. The XML files are valid with respect to the output format described below. You may use the source code for getting started with your own approach.

Output

Enclosed in the evaluation corpora, a file named pairs is found, which lists all pairs of suspicious documents and source documents to be compared. For each pair suspicious-documentXYZ.txt and source-documentABC.txt, your plagiarism detector shall output an XML file suspicious-documentXYZ-source-documentABC.xml which contains meta information about the plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

For example, the above file would specify an aligned passage of text between suspicious-documentXYZ.txt and source-documentABC.txt, and that it is of length 1000 characters, starting at character offset 5 in the suspicious document and at character offset 100 in the source document.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Download corpus 1 Download corpus 2 Download corpus 3

Submission

We ask you to prepare your software so that it can be executed via a command line call.

> mySoftware -i path/to/corpus -o path/to/output/directory

You can choose freely among the available programming languages and among the operating systems Microsoft Windows 7 and Ubuntu 12.04. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below

PAN Virtual Machine User Guide »

Once deployed in your virtual machine, your can move to submit your software. Before doing so, we provide your with a software submission readiness tester. Please use this tester to verify that your software works. Since we will be calling your software automatically in much the same ways as the tester does, this lowers the risk ot errors.

Download PAN Software Submission Readiness Tester

When your software is submission-ready, please mail the filled out submission.txt file found along the software submission readiness tester to pan@webis.de.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the PAN competition. We agree not to share your software with a third party or use it for other purposes than the PAN competition.

Results

The following table lists the performances achieved by the participating teams:

Text Alignment Performance
PlagdetTeam
0.87818Miguel A. Sanchez-Perez, Grigori Sidorov, and Alexander Gelbukh
Instituto Politécnico Nacional, Mexico
0.86933Gabriel Oberreuter and Andreas Eiselt
Innovand.io, Chile
0.86806Yurii Palkovskii and Alexei Belov
Zhytomyr Ivan Franko State University, Ukraine
0.85930Demetrios Glinos
University of Central Florida, USA
0.84404Prasha Shrestha, Suraj Maharjan, and Thamar Solorio
University of Alabama at Birmingham, USA
0.82952Diego Antonio Rodríguez Torrejón and José Manuel Martín Ramos
Universidad de Huelva, Spain
0.82642Philipp Gross and Pashutan Modaresi
pressrelations GmbH, Germany
0.82161Leilei Kong, Yong Han, Zhongyuan Han, Haihao Yu, Qibo Wang, Tinglei Zhang, Haoliang Qi
Heilongjiang Institute of Technology, China
0.67220Samira Abnar, Mostafa Dehghani, Hamed Zamani, and Azadeh Shakery
University of Tehran, Iran
0.65954Faisal Alvi°, Mark Stevenson*, and Paul Clough*
°King Fahd University of Petroleum & Minerals, Saudi Arabia, and *University of Sheffield, UK
0.42191Baseline
0.28302Lee Gillam and Scott Notley
University of Surrey, UK

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Learn more »

Related Work

This task has been run since PAN'09; here is a quick list of the respective proceedings and overviews:

Task Chair

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Tim Gollub

Tim Gollub

Bauhaus-Universität Weimar

Matthias Hagen

Matthias Hagen

Bauhaus-Universität Weimar

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València