Plagiarism Detection in Arabic Text (AraPlagDet)
2015

This task has two sub-tasks: External Plagiarism Detection and Intrinsic Plagiarism Detection. You can choose to solve one or both of them.

External Plagiarism Detection

Task

Identify the similar text fragments between the suspicious document and the potential sources of plagiarism. The suspicious document should be compared to other documents.

Training Corpus

You will be provided with a training corpus corpus to use it while developing your method. The corpus involves two collections: suspicious documents and source documents. Each suspicious document is associated with an XML document that determine the position of the plagiarised fragments and its source, which allows you to check the correctness of your detections.

Download corpus

Output

Participants should submit their runs on the test corpus in the form of XML documents following the annotation format of the actual plagiarism provided in the training corpus (PAN format). Below an example on the content of the XML documents that should be submitted by the participants.

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5" 
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>
Each <feature> tag desribes a detected fragment. The attributes this_offset and source_offset describe the position of the first character of the detected plagiarism fragment in the suspicious document and the source document respectively. The attributes this_length and source_length repesent the number of characters of the detected plagiarism fragment in the suspicious document and the source document respectively.
Performance Measures

The following measures are used to evaluate the performance of methods:

  • Precision and Recall at character level,
  • Granularity: it reports overlapping or multiple detections for a single plagiarism case,
  • Plagdet: a combination of the former measures to an overall score. This score is used to rank the methods.

Learn more »

To compute these measures, we use the same code provided at PAN@CLEF (author: Martin Potthast).

Download measures' code »

Test Corpus

This corpus will be provided just a couple of weeks before the end of the competition. You have to run your method on this corpus in order to compare it with others. The XML documents (the actual plagiarism annotation) will not be provided in this corpus until the end of the competition. 

Update: plagiarism annotations are now available for download.

Download corpus

Submission

(1) Run your method(s) on the test corpus, and generate for each suspicious document an XML document (with the same name) that contains the detected plagiarism (please follow the format shown above).
Please record the runtime of your method in terms of seconds on the training as well as the test corpus.

(2) Send your runs on the test corpus as well as the training one in separate zip files to araplagdet.fire2015@gmail.com .(you can also upload them in a site such as Google Drive or Dropbox and send us the links)
The names of the zip files should follow the format bellow:
- for the runs on the training corpus: external_train_your.familly.name.zip
- for the runs on the test corpus: external_test_your.familly.name.zip

If you submit more than 1 run for a subtask, please add a number to the name of you files e.g. external_test_your.familly.name_1.zip and external_test_your.familly.name_2.zip

In the email, mention the names of the sent zip files along with the runtime to generate each of them (preferably in seconds if possible, but you could put an approximate time in hours or minutes) . E.g.
external_test_your.familly.name.zip 1000 sec
external_train_your.familly.name.zip 1022 sec

Results

The winner is Magooda Team. Congratulation !

Rank Method Plagdet Precision Recall Granularity
1 Magooda_2 0.8021737 0.8521183 0.8314955 1.0694534
2 Magooda_3 0.7715430 0.8535960 0.7591500 1.0584507
3 Magooda_1 0.7669238 0.8048888 0.7863135 1.0523139
4 Palkovskii_1 0.6270522 0.9774681 0.5422843 1.1621349
# Baseline 0.6075470 0.9903910 0.5349007 1.2089249
5 Alzahrani 0.5739858 0.8308816 0.5304589 1.1857283
6 Palkovskii_3 0.5595334 0.6582152 0.5893837 1.1606464
7 Palkovskii_2 0.5178527 0.5642698 0.5893964 1.1634981

Learn more »

Participants

Magooda Ahmed Ezzat abdelGawad Magooda
Ahsraf Youssef Mahgoub
Mohsen Rashwan
RDI, Egypt
Palkovskii Yurii Palkovskii
Alexei Belov
SkyLine LLC and Plagiarism Detector Project, Ukraine
Alzahrani Salha Alzahrani Taif University, Saudi Arabia
Related Work

The external plagiarism detection task (on English documents) has been run from PAN'09 to PAN'11. Here is a quick list of the respective proceedings and overviews:

The external plagiarism detection task has been devided to Source Retrieval and Text Alignment sub-tasks since PAN'12. Here is a quick list of the respective proceedings and overviews:

The following references provide further information on external plagiarism detection and Arabic language:

Intrinsic Plagiarism Detection

Task

Identify the text fragments that are inconsistent with the rest of the document in terms of the writing style. The suspicious document should not be compared to any other documents.

Training Corpus

You will be provided with a training corpus corpus to use it while developing your method. The corpus involves a collection of suspicious documents. Each suspicious document is associated with an XML document that determine the position of the plagiarised fragments, which allows you to check the correctness of your detections.

Download corpus

Output

Participants should submit their runs on the test corpus in the form of XML documents following the annotation format of the actual plagiarism provided in the training corpus (PAN format). Below an example on the content of the XML documents that should be submitted by the participants.

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5" 
  this_length="1000"
/>
<feature ... />
...
</document>
Each <feature> tag desribes a detected fragment. The attribute this_offset describes the position of the first character of the detected plagiarism fragment in the suspicious document. The attribute this_lengthrepesents the number of characters of the detected plagiarism fragment in the suspicious document.
Performance Measures

The following measures are used to evaluate the performance of methods:

  • Precision and Recall at character level,
  • Granularity: it reports overlapping or multiple detections for a single plagiarism case,
  • Plagdet: a combination of the former measures to an overall score. This score is used to rank the methods.

Learn more »

To compute these measures, we use the same code provided at PAN@CLEF (author: Martin Potthast).

Download measures' code »

Test Corpus

This corpus will be provided just a couple of weeks before the end of the competition. You have to run your method on this corpus in order to compare it with others. The XML documents (the actual plagiarism annotation) will not be provided in this corpus until the end of the competition. 

Update: plagiarism annotations are now available for download.

Download corpus

Submission

(1) Run your method(s) on the test corpus, and generate for each suspicious document an XML document (with the same name) that contains the detected plagiarism (please follow the format shown above).
Please record the runtime of your method in terms of seconds on the training as well as the test corpus.

(2) Send your runs on the test corpus as well as the training one in separate zip files to araplagdet.fire2015@gmail.com .(you can also upload them in a site such as Google Drive or Dropbox and send us the links)
The names of the zip files should follow the format bellow:
- for the runs on the training corpus: intrinsic_train_your.familly.name.zip
- for the runs on the test corpus: intrinsic_test_your.familly.name.zip

If you submit more than 1 run for a subtask, please add a number to the name of you files e.g. intrinsic_test_myname_1.zip and intrinsic_test_myname_2.zip

In the email, mention the names of the sent zip files along with the runtime to generate each of them (preferably in seconds if possible, but you could put an approximate time in hours or minutes) . E.g.
intrinsic_test_yourname.zip 1000 sec
intrinsic_train_yourname.zip 1022 sec

Results

Only one participant !

Method Plagdet Precision Recall Granularity
Baseline 0.3753558 0.2691282 0.7792149 1.0934164
Magooda 0.1926474 0.1879309 0.1976069 1.0000000

Learn more »

Participants

Magooda Ahmed Ezzat abdelGawad Magooda
Ahsraf Youssef Mahgoub
Mohsen Rashwan
RDI, Egypt
Related Work

The intrinsic plagiarism detection task on English documents has been run from PAN'09 to PAN'12. Here is a quick list of the respective proceedings and overviews:

The following references provide further information on intrinsic plagiarism detection, its evaluation corpora development approaches and Arabic language:

Task Coordinators

Imene Bensalem

Imene Bensalem

Constantine 2 University, Algeria

Paolo Rosso

Paolo Rosso

Universitat Politècnica de València, Spain

Kareem Darwish

Kareem Darwish

Qatar Computing Research Institute, Qatar

Salim Chikhi

Constantine 2 University, Algeria

Imène Boukhalfa

Constantine 2 University, Algeria