The information has become easily accessible with the advent of the Web. Blogs, forums, repositories, etc. have made source code widely available to be read, to be copied and to be modified. Programmers are tempted to re-use debugged and tested source codes that can easily found on the Web. The vast amount of resources on the Web makes unfeasible the manual analysis of suspicious source code re-used. There is a need of development automatic systems for detecting source code re-use phenomenon.
Software companies have a special interest in preserving their own intellectual property. In a survey of 3,970 developers, more than 75 percent of respondents admitted that have re-use blocks of source code from elsewhere. Academic and programming environments have become a potential scenario for research in source code re-use because it is a frequent practice between students. In this context, students are tempted to re-use source code because they have to solve the same problem. It is a difficult scenario for detecting source code re-use because all the source codes will contain some degree of similarity.
SOCO is a new task focused on source code re-use. We will offer a task based around the detection of source codes that have been re-used in a monolingual context such an academic environment. The task will involve identifying and distinguishing the most similar source code pairs among a large source code collection.
In this first edition SOCO focuses on monolingual source code re-use detection. Nevertheless, there will be source codes in diferent languages. Source codes will be tagged by language to ease the detection. You are provided with a set of source codes written in C and Java languages. The task is about retrieving the source code pairs that have been re-used. Note that this is a document level task. No specific fragments inside of the source codes are expected to be identified; only pairs of source codes. Participants are asked to determine whether a source code has been re-used.
In the test phase the only annotation that will be provided is the programming language extensions.IMPORTANT: 3 RUNS per team are allowed as a maximum.
For the training phase we provide an annotated corpus including with the programming language extensions. It includes information about whether a text fragment has been re-used and, if it is the case, what its source is.
Corpus train
md5sum a0588d47c8c3dcd476f37caef0eb8606 , 2.52 kB
Java Training Relevance Judgements
C Training Relevance Judgements
The results of your re-use detection software are required to be formatted in XML:
<document> <reuse_case source_code1="..." <!-- file name of the first source code re-used element --> source_code2="..." <!-- file name of the second source code re-used element --> /> <reuse_case source_code1="..." <!-- file name of the first source code re-used element --> source_code2="..." <!-- file name of the second source code re-used element --> /> ......................... <!-- more detections in the collection --> </document > |
The success of the detection of source code re-use will be measured in terms of Precision, Recall and F1 measure.
We refer you to:
Lidia Moreno
Universitat Politècnica de València, Spain