Similar Repositories on GitHub
A tool was developed for searching GitHub for similar repositories. We pretrained embeddings of sub-tokens with fastText on a dataset of 120,000 GitHub projects and used another dataset of 9 million projects as our reference search codebase. The tool works in two steps. On the first step, the user inputs the target projects (as either directories or links to GitHub) and the tool recognizes the most popular programming languages, parses the code, extracts the identifiers, and splits them into sub-tokens. This part of the pipeline is available as a stand-alone tool.
On the second step, the tool calculates the embeddings for these sub-tokens and uses the reference codebase to discover the most similar repositories. It is important to note that the output of the tool is also interpretable, because we manually labeled the clusters of sub-tokens that are used for grouping the repositories.
The developed tool on GitHub.
The auxiliary extractor of identifiers on GitHub.
- Authorship Attribution of Source Code
- Automatic Classification of Error Types
- BSL Code Synthesizer
- Change Patterns in Python
- Code Clone Detection
- Code Completion
- Code Representation
- Code Style Embeddings
- Coding Assistant
- Deep Bugs Detector
- Deep Code Completion
- Embeddings of Code Changes
- GitHub License Violations Study
- Java Context Helper
- Large-Scale Anomaly Detection for Kotlin
- NL-to-Code Synthesis
- Similar Repositories on GitHub
- The Dynamics of Topics in Code