Research group

Machine Learning Methods in Software Engineering

GitHub License Violations Study

Project supervisor: Timofey Bryksin
Status: Active

In this project, a complex plagiarism analysis of code fragments is conducted for GitHub's Java code. The project consists of three parts: gathering of a large (1.5 Tb) corpus of Java repositories, searching it for clones (using an approach proposed in our other project), and the analysis itself, studying plagiarism and license violations in the obtained data. Discovered licenses and relationships between them are studied in great detail, and the similar fragments of code are ranged by the possibility of them constituting a license violation.

The project's repository on GitHub.