Detection of SOurce COde Re-use 2014

Synopsis
Task
Data
Output
Evaluation
Submission
Related Work
Task Committee

Synopsis

Task: Given a source code, is it an original? This task focuses on the detection of cross-language source code re-use. It is about searching for likely source code pairs written in different programming languages of being a re-use case.
Test and Training Data (with relevance judgements) [data]
Evaluation [script]

Introduction

The information has become easily accessible with the advent of the Web. Blogs, forums, repositories, etc. have made source code widely available to be read, to be copied and to be modified. Programmers are tempted to re-use debugged and tested source codes that can easily found on the Web. The vast amount of resources on the Web makes unfeasible the manual analysis of suspicious source code re-used. There is a need of development automatic systems for detecting source code re-use phenomenon.

Software companies have a special interest in preserving their own intellectual property. In a survey of 3,970 developers, more than 75 percent of respondents admitted that have re-use blocks of source code from elsewhere. Academic and programming environments have become a potential scenario for research in source code re-use because it is a frequent practice between students. In this context, students are tempted to re-use source code because they have to solve the same problem. It is a difficult scenario for detecting source code re-use because all the source codes will contain some degree of similarity.

SOCO is a new task focused on source code re-use. We will offer a task based around the detection of source codes that have been re-used in a monolingual context such an academic environment. The task will involve identifying and distinguishing the most similar source code pairs among a large source code collection.

Task

In this first edition SOCO focuses on monolingual source code re-use detection. Nevertheless, there will be source codes in diferent languages. Source codes will be tagged by language to ease the detection. You are provided with a set of source codes written in C and Java languages. The task is about retrieving the source code pairs that have been re-used. Note that this is a document level task. No specific fragments inside of the source codes are expected to be identified; only pairs of source codes. Participants are asked to determine whether a source code has been re-used. In the test phase the only annotation that will be provided is the programming language extensions. IMPORTANT: 3 RUNS per team are allowed as a maximum.

Data

For the training phase we provide an annotated corpus including with the programming language extensions. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is.

The collection consists of source codes written in Java and C.
Re-use is commited in both programming languages but ONLY at monolingual level.
The Java collection contains 259 source codes from 000.java to 258.java.
The C collection contains 79 source codes from 000.c to 078.c.
Relevance Judgements represent re-use in both directions(a→b and b→a)

In the test phase the only annotation that will be provided in the corpus is the programming language extensions.

It is divided by programming language (C/C++ and JAVA) so you do not need any pre-process to identify the programming language of the source codes.
Each programming language folder contains 6 folders (A1, B1, B2, C1 and C2) that contains a specific scenario with monolingual re-use.
There is not re-use between scenarios so you just need to look for re-used cases among the source code files inside each folder.
The name of the files consists of the name of the task which they belong plus an identifier. For example, file "B10021" belongs to scenario B1 and its identifier number is 0021.
It could not exist re-use between source codes that belong to different scenarios. For example, you do not have to submit a re-used case between files "B10021" and "B20013". The first one belongs to scenario B1 but the second one belongs to B2.

Output

The results of your re-use detection software are required to be formatted in XML:

<document>
<reuse_case
  source_code1="..."   <!-- file name of the first source code re-used element  -->
  source_code2="..."   <!-- file name of the second source code re-used element  -->
/>
<reuse_case
  source_code1="..."   <!-- file name of the first source code re-used element  -->
  source_code2="..."   <!-- file name of the second source code re-used element  -->
/>
...                    <!-- more detections in the collection -->
</document >

For each pair of suspicious source code pair there will be one entry of the <reuse_case .../> in the xml file.

Evaluation

The success of the detection of source code re-use will be measured in terms of Precision, Recall and F1 measure.

Run the evaluation script as python soco14-update.py

Submission

The participants can use the training data to build their systems, tune parameters, proof-of-concept, manual analysis etc. Although participants can analyse their runs on test data, it is not desirable that they tune system according to test data. It is both, unreliable and unethical, in shared task environment.

Participants are allowed to submit up to three runs per language pair in order to experiment with different settings.

The results of your detection are required to be formatted as is explained in Task Description section.

Chae D., Ha J., Kim S., Kang B., Im, E. (2013). Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1577-1580. ACM.
Marinescu D., Baicoianu A., Dimitriu S. (2012). Software for Plagiarism Detection in Computer Source Code. In Proc. of the 7th International Conference on Virtual Learning ICVL 2012, pp. 373-379.
Flores E., Barrón-Cedeño A., Rosso P., Moreno L. (2011). Towards the Detection of Cross-Language Source Code Reuse. In: Proc. 16th Int. Conf. on Applications of Natural Language to Information Systems, NLDB-2011, Springer-Verlag, LNCS(6716), pp. 250-253.
Flores E., Barrón-Cedeño A., Rosso P., Moreno L. (2012). DeSoCoRe: Detecting Source Code Re-Use across Programming Languages. In: Proc. 12th Int. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1-4.
Flores E., Barrón-Cedeño A., Moreno L., Rosso P. (2014) Uncovering Source Code Re-Use in Large-Scale Programming Environments. In: Computer Applications in Engineering and Education. DOI: 10.1002/cae.21608