JetBrains Research — наука, меняющая мир

BioLabs summer student internships 2019

Internship results

Each year JetBrains offers internship opportunities for talented and motivated students which allow students to become full-time developers working on challenging tasks and projects. During two months, interns work closely with teams, learn new technologies, and get hands-on experience from professionals in a number of various fields from Software Development and Machine Learning to Design and Documentation Writing.

JetBrain Research BioLabs team works with students from Compute Science Center (https://compscicenter.ru), Bioinformatics Institute (https://bioinf.me), IFMO University (https://en.itmo.ru/en/) regularly, and this year was no exception.

In 2019 BioLabs team offered four projects for summer internships:

  1. ARM using Fishbone diagrams
  2. Publications analysis
  3. Snakecharm plugin
  4. SPAN model improvement

ARM using Fishbone diagrams

Association Rule Mining is a data mining technique for exploring hidden dependencies from observational data. Fishbone ARM is a novel approach combining bottom-to-top rules mining with information theory and multiple testing resulting in statistically significant and interpretable results. Visualization with Ishikawa diagrams helps to understand resulting relationships and provides rich capabilities for data filtration.

"ARM using Fishbone diagram" is a follow-up project done by students in Bioinformatics Institute. Student tasks were:

  1. Investigate and improve algorithms
  2. Improve the visualization web service
  3. Evaluate algorithm on Ciofani dataset

On figure 1, we present the web interface of the service.

3uuPJEzj3DKs2gw9_sxIa-07zCZ-aKnULYGYvw_F



Daria Likholetova and Nina Lukashina had been working on this project during the summer internship under mentorship by Peter Tsurinov. Daria was working on biological interpretation of algorithm’s results as well as on data preparation and analysis. Nina was mostly focused on algorithms and web service development. They achieved the following results:

  • Improved algorithm by adding LOE criterion, a measure of interestingness, and statistical significance check
  • Improved service usability because of better UI
  • Successfully validated the new approach on Ciofani dataset
  • Rules have a reasonable biological meaning

These findings suggest that the method can produce novel biological knowledge from observational data and provide rich visualization and analytic capabilities.

Full project presentation is available here: https://docs.google.com/presentation/d/1PIJNECrEmm_4x3svyt8EHEFEvetZxUFq65Ffq0K5zGI/edit?usp=sharing


Publications analysis

Publication analysis service is an exploratory tool for researchers providing faster trends analysis and breakthrough papers discovery among steadily growing flow of papers worldwide. The service aims to solve two tasks: explore popular trends in publications on a given topic and help to find new promising directions. At the moment the service incorporates the Pubmed database and the Semantic Scholar archive. Semantic Scholar aggregates significant journals and publishers, including Springer Nature, ACM, etc. Together, the two contain 45 mln papers and 170 mln references.

"Publications analysis" is a follow-up project done by students in the Computer Science Center. It combines graph analysis of citation network with time series analysis.

Student tasks were:

  1. Support the Semantic Scholar as a data source
  2. Improve search by using PostgreSQL text search
  3. Reorganize web UI
  4. Add zoom functionality and support dedicated paper analytics
  5. Work on topic evolution and paper impact predictions
  6. Make the first alpha release

Anna Vlasova and Nickolay Kapralov had been working on this project under mentorship by Oleg Shpynov. They successfully fulfilled most of these tasks.

On figure 2, we show the results page of the web service.

uMUakd5Bq-gGpAER19pEcPRADGQeI-QEL_N5H75n

It contains a summary overview of the field as well as detailed information about the total number of publications, topics and trends, authors and journals and word clouds for each subject. Top cited papers allow a user to get acquainted with the new area fast and efficiently, cutting all the clutter. Trends section shows well-established and fast-growing directions in the field.

Experiments suggest that it is possible to predict the scientific impact of a paper, based on a limited amount of information about it. We will address these questions in a follow-up project.

Full project presentation is available at https://docs.google.com/presentation/d/131qvkEnzzmpx7-I0rz1om6TG7bMBtYwU9T1JNteRIEs/edit?usp=sharing



Snakecharm plugin

Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to computer clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and output filenames.

Snakecharm is a plugin for PyCharm and any other IntellIJ platform-based IDEs providing rich editing and refactoring capabilities for Snakemake programming language. Deep understanding of the code allows the IDE editor to guide you through all the steps of pipeline development. Smart refactorings will enable you to improve and modify your project safely and efficiently. Built-in inspections highlight all the possible flaws and errors, significantly reducing the number of runtime problems.

"Snakecharm plugin" is a follow-up project done during JetBrains student practiсe, fall 2019. Daria Sharkova and Nikita Nazarov had been working for two months under mentorship by Roman Chernyatchik on the following tasks:

  1. Support highlighting, autocompletion and resolving for Snakemake string-format language
  2. Improve parsing to comply with Snakemake language specification
  3. Autocompletion and resolving for wildcards and section names in rules
  4. Subworkflows syntax highlighting
  5. Different inspections based on Snakemake semantics

On figure 3, we show the screenshot of PyCharm IDE with Snakemake code on it.

KbxEf97lo92MUt_pUrNx4UtdFSt2JYPH0kQYK7SR

The plugin is released and available in the official plugin repository at https://plugins.jetbrains.com/plugin/11947-snakech...

The full presentation is available at https://docs.google.com/presentation/d/13TvCId2d8YaHC4t4LLmolxf8p_qg76pea16-z53QcX4/edit?usp=sharing


SPAN model improvement

SPAN is a semi-supervised Peak Analyser for ChIP-seq data. ChIP-seq is a powerful biological method to evaluate proteins binding positions to DNA. Ultra-Low Input ChIP-seq is an experimental modification of this method which requires ten times less cellular material. In the experiment "Multiomics dissection of healthy human aging" (https://artyomovlab.wustl.edu/aging/) we used ULI ChIP-seq to evaluate different epigenetic signals for each person systematically.

We developed the novel semi-supervised approach to peak calling. Fast and effective semi-supervised peak analyzer is a multipurpose peak caller capable of processing both conventional and ULI Chip-seq tracks. In the semi-supervised method, user annotates a handful of locations as peaks, valleys or peak shores, and then uses these annotations to train the model that is optimal for a given sample.

SPAN implies Hidden Markov Models with Zero and Negative Binomial emissions to model and detect enrichment in the raw data signal. Other publications propose various methods of how to use DNA sequence content to improve prediction quality, and we tried to implement some of them.

Internship tasks were:

  1. Develop code for fitting a generalized linear model in Kotlin
  2. Improve performance of Apache Commons Math library
  3. Use control, GC-content and genomic mappability as covariates
  4. Evaluate this approach on existing ULI ChIP-seq dataset

"SPAN model improvement" is a follow-up project done by students in the Bioinformatics Institute. Elena Kartysheva had been working on these tasks for two months under mentorship by Aleksei Dievskii. The intern successfully implemented a new SPAN model and compared it with the previous baseline. Even though the comparison of different computational models in the unsupervised environment is a tough question, the proposed approach can improve SPAN performance.

The full presentation is available here: https://drive.google.com/file/d/1sP9cnEcvdo-cerXVAbribWfLmQvlAA0C/view?usp=sharing


It was very inspiring for us mentors to work with such motivated and highly skilled students, and we are looking forward to new practices and internships.

--

Oleg Shpynov,
JetBrains Research BioLabs