Detection of SOurce COde Re-use 2015

Synopsis
Task
Data
Output
Evaluation
Submission
Related Work
Task Committee

Synopsis

Task: Given a source code, is it an original? This task focuses on the detection of cross-language source code re-use. It is about searching for likely source code pairs written in different programming languages of being a re-use case.
Training Data [corpus] [relevance-judgements]
Test Data [corpus]
Evaluation [script] [example-output]

Introduction

The information has become easily accessible with the advent of the Web. Blogs, forums, repositories, etc. have made source code widely available to be read, to be copied and to be modified. Programmers are tempted to re-use debugged and tested source codes that can easily found on the Web. The vast amount of resources on the Web makes unfeasible the manual analysis of suspicious source code re-used. There is a need of development automatic systems for detecting source code re-use phenomenon.

Software companies have a special interest in preserving their own intellectual property. In a survey of 3,970 developers, more than 75 percent of respondents admitted that have re-use blocks of source code from elsewhere. Academic and programming environments have become a potential scenario for research in source code re-use because it is a frequent practice between students. In this context, students are tempted to re-use source code because they have to solve the same problem. It is a difficult scenario for detecting source code re-use because all the source codes will contain some degree of similarity.

The second edition of the SOCO task focuses on cross-language source code re-use detection. Participants will be provided with cross-language training and test sets of source code files. The task is about retrieving the source code pairs that have been re-used across programming languages.

Task

In this second edition SOCO focuses on cross-language source code re-use detection. Source codes will be tagged by language to ease the detection. You are provided with a set of source codes written in C and Java languages. The task is about retrieving the source code pairs that have been re-used across programming languages. This is a document level task, so no specific fragments inside of the source codes are expected to be identified; only pairs of source codes. Participants are asked to determine whether a source code has been plagiarised across programming languages. In the test phase the only annotation that will be provided is the programming language extensions. IMPORTANT: 3 RUNS per team are allowed as a maximum.

Data

For the training phase we provide an annotated corpus including with the programming language extensions. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is.

The collection consists of source codes written in Java and C.
The Java and C collection contain 599 source codes numbered from 001 to 599.
Reuse is commited across programming languages.
Re-used cases are those with the same file number (e.g. 011.c ↔ 011.java).
Relevance Judgements represent re-use in both directions(a→b and b→a)

In the test phase the only annotation that will be provided in the corpus is the programming language extensions.

It is divided by programming language (Java and C) so you do not need any pre-process to identify the programming language of the source codes.
The Java and C collection contain 79 source codes numbered from 000 to 078.
Cross-language source code re-use is only committed from C to Java language.

Output

The results of your re-use detection software are required to be formatted in XML:

<document>
<reuse_case
  source_codeC="..."   <!-- file name of the first source code re-used element  -->
  source_codeJ="..."   <!-- file name of the second source code re-used element  -->
/>
<reuse_case
  source_codeC="..."   <!-- file name of the first source code re-used element  -->
  source_codeJ="..."   <!-- file name of the second source code re-used element  -->
/>
...                    <!-- more detections in the collection -->
</document >

For each pair of suspicious source code pair there will be one entry of the <reuse_case .../> in the xml file.

Evaluation

The success of the detection of source code re-use will be measured in terms of Precision, Recall and F1 measure.

Run the evaluation script as python clsoco15-eval.py

Submission

The participants can use the training data to build their systems, tune parameters, proof-of-concept, manual analysis etc. Although participants can analyse their runs on test data, it is not desirable that they tune system according to test data. It is both, unreliable and unethical, in shared task environment.

Participants are allowed to submit up to three runs per language pair in order to experiment with different settings.

The results of your detection are required to be formatted as is explained in Task Description section.

Similarities/Differences with SOCO-2014

For your convenience, we summarize here the main similarities/differences of the source code re-use detection task @PAN-FIRE-2015 with respect to the corresponding task @PAN-FIRE-2014:

p> Similarities:

The task definition is essentially the same
The format of corpus and ground truth is the same
The format of input/output of your software is ALMOST the same
The evaluation measures are the same

Differences:

The detection of source code re-use is established between C and Java programming languages.

PAN @ FIRE'14 (overview paper)
Arwin, C., Tahaghoghi, S. (2006). Plagiarism detection across programming languages. In Proceedings of the 29th Australasian Computer Science Conference, 48, pp. 277-286. Australian Computer Society, Inc.
Flores E., Barrón-Cedeño A., Moreno L., Rosso P. (2014) Cross-Language Source Code Re-Use Detection. In: Proc. 3rd Spanish Conf. on Information Retrieval, pp. 145-156.
Flores E., Barrón-Cedeño A., Rosso P., Moreno L. (2012). DeSoCoRe: Detecting Source Code Re-Use across Programming Languages. In: Proc. 12th Int. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1-4.
Flores E., Rosso P., Moreno L., Villatoro-Tello E. (2014) PAN@FIRE: Overview of SOCO Track on the Detection of SOurce COde Re-use. In Proceedings of the Sixth Forum for Information Retrieval Evaluation (FIRE 2014).
Flores E., Barrón-Cedeño A., Moreno L., Rosso P. (2015) Uncovering Source Code Re-Use in Large-Scale Programming Environments. In: Computer Applications in Engineering and Education, 23(3), pp. 383-390. DOI: 10.1002/cae.21608