Research group

Machine Learning Methods in Software Engineering

Similar Repositories on GitHub

Project supervisor: Timofey Bryksin
Status: Active

A tool was developed for searching GitHub for similar repositories. We pretrained embeddings of sub-tokens with fastText on a dataset of 120,000 GitHub projects and used another dataset of 9 million projects as our reference search codebase. The tool works in two steps. On the first step, the user inputs the target projects (as either directories or links to GitHub) and the tool recognizes the most popular programming languages, parses the code, extracts the identifiers, and splits them into sub-tokens. This part of the pipeline is available as a stand-alone tool.

On the second step, the tool calculates the embeddings for these sub-tokens and uses the reference codebase to discover the most similar repositories. It is important to note that the output of the tool is also interpretable, because we manually labeled the clusters of sub-tokens that are used for grouping the repositories.

The developed tool on GitHub.

The auxiliary extractor of identifiers on GitHub.