Research group

Machine Learning Methods in Software Engineering

GitHub License Violations Study

Timofey BryksinActive

In this project, complex plagiarism analysis of code fragments is conducted for Java code from GitHub. The project consists of three parts: gathering of a large (1.5 Tb) corpus of Java repositories, searching it for clones (using the approach proposed in our other project), and the analysis itself, studying plagiarism and license violations in the obtained data. Discovered licenses and relationships between them are studied in great detail, and similar fragments of code are ranged by the possibility of them constituting a license violation.

The project's repository on GitHub.



A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub

June 2020

Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, Timofey Bryksin

Read more