Plagiarism Detection
2009

The detection of plagiarism by hand is a laborious retrieval task---a task which can be aided or automatized. The PAN competition on plagiarism detection shall foster the development of new solutions in this respect.

Award
Yahoo! Research
Sponsor

We are happy to announce the following overall winner of the 1st International Competition on Plagiarism Detection who will be awarded 500,- Euro sponsored by Yahoo! Research:

  • Task winner of the external plagiarism detection task, and overall winner, are Cristian Grozea, Christian Gehl, and Marius Popescu from Fraunhofer FIRST and the University of Bucharest.
  • Task winner of the intrinsic analysis task is Efstathios Stamatatos from the University of the Aegean.
Congratulations!

External Plagiarism Detection

Task

Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents.

Training Corpus

To develop your approach, we provide you with a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents.

Learn more » Download corpus

Output

For each suspicious document suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an XML file suspicious-documentXYZ.xml which contains meta information about all plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
  <feature name="detected-plagiarism"
           this_offset="5"
           this_length="1000"
           source_reference="source-documentABC.txt"
           source_offset="100"
           source_length="1000"
  />
  ...
</document>

The XML documents must be valid with respect to the XML schema found here.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus does not contain ground truth data that reveals whether or not a suspicious document contains any plagiarized passages. To find out the performance of your software on the test corpus, you must collect the output its and submit it as described below.

After the competition, the test corpus is updated to include the ground truth data. This way, you have all the neccessary data to evaluate your approach on your own, without submitting it's output, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

External plagiarism detection peformance
PlagdetParticipant
0.6957C. Grozea*, C. Gehl*, and M. Popescu°
*Fraunhofer FIRST, Germany, and °University of Bucharest, Romania
0.6093J. Kasprzak, M. Brandejs, and M. Křipač
Masaryk University, Czech Republic
0.6041C. Basile*, D. Benedetto°, E. Caglioti°, G. Cristadoro*, and M. Degli Esposti*
*Università di Bologna and °Università La Sapienza, Italy
0.3045Y. A. Palkovskii
Zhytomyr State University, Ukraine
0.1885M. Zechner, M. Muhr, R. Kern, and M. Granitzer
Know-Center Graz, Austria
0.1422V. Shcherbinin* and S. Butakov°
*American University of Nigeria, Nigeria, and
°Solbridge International School of Business, South Korea
0.0649R. C. Pereira, V. P. Moreira, and R. Galante
Universidade Federal do Rio Grande do Sul, Brazil
0.0264E. Vallés Balaguer
Private, Spain
0.0187J. A. Malcolm and P. C. R. Lane
Ferret, University of Hertfordshire, UK
0.0117J. Allen
Southern Methodist University in Dallas, USA

A more detailed analysis of the detection performances with respect to precision, recall, and granularity can be found in the overview paper accompanying this task.

Learn more »

Intrinsic Plagiarism Detection

Task

Given a set of suspicious documents the task is to identify all plagiarized text passages, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.

Training Corpus

To develop your approach, we provide you with a training corpus which comprises a set of suspicious documents, each of which may contain plagiarized passages.

Learn more » Download corpus

Output

For each suspicious document suspicious-documentXYZ.txt found in the evaluation corpora, your plagiarism detector shall output an XML file suspicious-documentXYZ.xml which contains meta information about all plagiarism cases detected within:

<document reference="suspicious-documentXYZ.txt">
  <feature name="detected-plagiarism"
           this_offset="5"
           this_length="1000"
  />
  ...
</document>

The XML documents must be valid with respect to the XML schema found here.

Performance Measures

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score, which is a combination of the first three measures. For your convenience, we provide a reference implementation of the measures written in Python.

Learn more » Download measures

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus does not contain ground truth data that reveals whether or not a suspicious document contains any plagiarized passages. To find out the performance of your software on the test corpus, you must collect the output its and submit it as described below.

After the competition, the test corpus is updated to include the ground truth data. This way, you have all the neccessary data to evaluate your approach on your own, without submitting it's output, yet being comparable to those who took part in the competition.

Download corpus

Submission

To submit your test run for evaluation, we ask you to send a Zip archive containing the output of your software when run on the test corpus to pan@webis.de.

Should the Zip archive be too large to be sent via mail, please upload it to a file hoster of your choosing and share a download link with us.

Results

The following table lists the performances achieved by the participating teams:

Intrinsic plagiarism detection performance
PlagdetParticipant
0.2462E. Stamatatos
University of the Aegean, Greece
0.1955B. Hagbi and M. Koppel
Bar Ilan University, Israel
0.1766M. Zechner, M. Muhr, R. Kern, and M. Granitzer
Know-Center Graz, Austria
0.1219L. M. Seaward and S. Matwin
University of Ottawa, Canada

A more detailed analysis of the detection performances can be found in the overview paper accompanying this task.

Learn more »

Task Chair

Martin Potthast

Martin Potthast

Bauhaus-Universität Weimar

Task Committee

Benno Stein

Benno Stein

Bauhaus-Universität Weimar

Andreas Eiselt

Andreas Eiselt

Bauhaus-Universität Weimar

Alberto Barrón-Cedeño

Alberto Barrón-Cedeño

Universitat Politècnica de València

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València