GitHub License Violations Study
In this project, complex plagiarism analysis of code fragments is conducted for Java code from GitHub. The project consists of three parts: gathering of a large (1.5 Tb) corpus of Java repositories, searching it for clones (using the approach proposed in our other project), and the analysis itself, studying plagiarism and license violations in the obtained data. Discovered licenses and relationships between them are studied in great detail, and similar fragments of code are ranged by the possibility of them constituting a license violation.
The project's repository on GitHub.
A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub
Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, Timofey Bryksin