Plagiarism Detection
2011

This task is divided into external plagiarism detection and intrinsic plagiarism detection. You can choose to solve one or both of them.

Award
Yahoo! Research
Sponsor

We are happy to announce the following overall winner of the 1st International Competition on Plagiarism Detection who will be awarded 500,- Euro sponsored by Yahoo! Research:

  • Task winner of the external plagiarism detection task are J. Grman and R. Ravas from SVOP Ltd., Slovakia.
  • Task winner of the intrinsic analysis task, and overall winner, are Gabriel Oberreuter, Gaston L'Huillier, Sebastian Rios and Juan Velasquez from the University of Chile.
Congratulations!

External Plagiarism Detection

Task
Given a set of suspicious documents and a set of potential source documents, the task is to find all plagiarized passages in the suspicious documents and their corresponding source passages in the source documents.
Training Corpus

To develop your approach, we provide you with a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages, the source passages of which may or may not be present in one or more of the source documents.

Learn more » Download corpus

Output

For each suspicious document suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an XML file suspicious-documentXYZ.xml which contains meta information about all plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
  <feature name="detected-plagiarism"
           this_offset="5"
           this_length="1000"
           source_reference="source-documentABC.txt"
           source_offset="100"
           source_length="1000"
  />
  ...
</document>

The XML documents must be valid with respect to the XML schema found here.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus does not contain ground truth data that reveals whether or not a suspicious document contains any plagiarized passages. To find out the performance of your software on the test corpus, you must collect the output its and submit it as described below.

After the competition, the test corpus is updated to include the ground truth data. This way, you have all the neccessary data to evaluate your approach on your own, without submitting it's output, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

External Plagiarism Detection Performance
PlagdetParticipant
0.5563J. Grman and R. Ravas
SVOP Ltd., Slovakia
0.4153C. Grozea* and M. Popescu°
*Fraunhofer Institute FIRST, Germany
°University of Bucharest, Romania
0.3469G. Oberreuter, G. L'Huillier, S A. Ríos, and J.D. Velásquez
Universidad de Chile, Chile
0.2467N. Cooke, L. Gillam, P. Wrobel, H. Cooke, and F. Al-Obaidli
University of Surrey, United Kingdom
0.2340D.A. Rodríguez Torrejón*,° and J.M. Martín Ramos°
*IES "José Caballero", Spain
°Universidad de Huelva, Spain
0.1991S. Rao, P. Gupta, K. Singhal, and P. Majumder
DA-IICT, India
0.1892Y. Palkovskii, A. Belov, I. Muzyka
Zhytomyr State University and SkyLine Inc., Ukraine
0.0804R.M.A. Nawab, M. Stevenson, and P. Clough
University of Sheffield, United Kingdom
0.0012A. Ghosh, P. Bhaskar, S. Pal, and S. Bandyopadhyay
Jadavpur University, India

A more detailed analysis of the detection performances with respect to precision, recall, and granularity can be found in the overview paper accompanying this task.

Learn more »

Related Work

For an overview of approaches to plagiarism detection, we would like to refer you to the proceedings of the past two plagiarism detection competitions: PAN @ CLEF'10 (overview paper) and PAN @ SEPLN'09 (overview paper).

Intrinsic Plagiarism Detection

Task
Given a set of suspicious documents, the task is to extract all plagiarized passages without comparing them to potential source documents.
Training Corpus

To develop your approach, we provide you with a training corpus which comprises a set of suspicious documents, each of which may contain plagiarized passages.

Learn more » Download corpus

Output

For each suspicious document suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an XML file suspicious-documentXYZ.xml which contains meta information about all plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
  <feature name="detected-plagiarism"
           this_offset="5"
           this_length="1000"
  />
  ...
</document>

The XML documents must be valid with respect to the XML schema found here.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus does not contain ground truth data that reveals whether or not a suspicious document contains any plagiarized passages. To find out the performance of your software on the test corpus, you must collect the output its and submit it as described below.

After the competition, the test corpus is updated to include the ground truth data. This way, you have all the neccessary data to evaluate your approach on your own, without submitting it's output, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

Intrinsic Plagiarism Detection Performance
PlagdetParticipant
0.3255G. Oberreuter
Universidad de Chile, Chile
0.1680M. Kestemont, K. Luyckx, and W. Daelemans
University of Antwerp, Belgium
0.0841N. Akiva
Bar Ilan University, Israel
0.0694S. Rao, P. Gupta, K. Singhal, and P. Majumder
DA-IICT, India

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Learn more »

Task Chair

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Andreas Eiselt

Andreas Eiselt

Bauhaus-Universität Weimar

Alberto Barrón-Cedeño

Alberto Barrón-Cedeño

Universitat Politècnica de València

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València