Machine Learning Methods in Software Engineering
Similar Repositories on GitHub
We developed a tool for searching GitHub and identifying similar repositories. We pretrained embeddings of sub-tokens with fastText on a dataset of 120,000 GitHub projects and used another dataset of 9 million projects as our reference search codebase. The tool works in two steps. In the first step, the user inputs the target projects (as either directories or links to GitHub) and the tool recognizes the most popular programming languages, parses the code, extracts the identifiers, and splits them into sub-tokens. This part of the pipeline is available as a stand-alone tool.
In the second step, the tool calculates the embeddings for these sub-tokens and uses the reference codebase to discover the most similar repositories. It is important to note that the output of the tool is also interpretable, because we manually labeled the clusters of sub-tokens that are used for grouping the repositories.
The developed tool on GitHub.
The auxiliary extractor of identifiers on GitHub.
Participants
Publications
Sosed: a Tool for Finding Similar Software Projects
September 2020
Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov, Vladimir Kovalenko and Timofey Bryksin