Research group



Paper-analyzer is a web-based application that performs search queries on a collection of 30 million PubMed paper abstracts. The search query includes gene (according to NCBI gene id) and (or) MeSH(Medical Subject Heading). Users can also specify taxon names for a gene and type the text context and author name to narrow the search. We trained the model to perform a search for relations between entities, participating in the search queries. Right now, we can find connections between genes and diseases, chemicals and genes, and chemicals and diseases. There are 13 types of relations like a marker - mechanism, therapeutic, increase or decrease expression, activity or metabolic processing, and so on. There are many examples when the abstract doesn't contain an explicit statement about the presence of relationships between entities in question. We trained a Natural Language Understanding model based on Transformer architecture called BERT to address this problem. We took positive relation examples from the Comparative Toxicogenomics Database (CTD) to train the model. We use PubTator application for named entity recognition and entity name normalization tasks, but we plan to substitute it with our own NER system shortly. We also plan to add gene-gene relations from Reactome to our search system.

We preprocess all the abstracts preliminarily and store relations in a database.

Relations in a database

After submitting the query, the user gets a list of resulting papers, aggregated by relation endpoints and types. One can collapse relation types and sort search results by a score (model confidence), publication year, or a number of papers in a group.

List of resulting papers

Users can explore the search result at a level of particular abstracts by selecting papers grouped by relation types. One can filter abstracts by publication year using a histogram. We also provide detailed information about a paper and links to PubMed and PubTator.

Search result

We are working now on extracting additional information about entities and relations from text. As for now, one can see contexts, found in sentences, containing both entities, forming a relation.

Additional information about entities and relations from text

Extracted Relations Database

As the result of Relation Extraction model application to PubMed abstracts we obtained a database of extracted relations. We are going to update this database in case of model changes.

The RE database is a tsv file with columns

  • ‘NameFrom’ - relation tail entity name;
  • ‘IdFrom’ - relation tail NCBI/MESH id;
  • ‘GroupFrom’ - relation tail group name (chemical, disease, gene);
  • ‘NameTo’ - relation head entity name;
  • ‘IdTo’ - relation head NCBI/MESH id;
  • ‘GroupTo’ - relation tail group name (chemical, disease, gene);
  • ‘Relation’ - relation type (see below the list of relation types);
  • ‘PMID’ - PMID of the paper, where the relation is found;
  • ‘Prob’ - model credibility of extracted relation.

We consider the next classes of relation types:

  • chem_disease_marker/mechanism
  • chem_disease_therapeutic
  • chem_gene_affects_response_to_substance
  • chem_gene_affects_transport
  • chem_gene_decreases_activity
  • chem_gene_decreases_expression
  • chem_gene_decreases_metabolic_processing
  • chem_gene_decreases_reaction
  • chem_gene_increases_activity
  • chem_gene_increases_expression
  • chem_gene_increases_metabolic_processing
  • chem_gene_increases_reaction
  • gene_disease_marker/mechanism
  • gene_disease_therapeutic

This types are a subset of types mentioned in CTD database.

Release of June 18, 2020

Release description:

  • Improved performance of relation extraction with BioBERT.
  • Added calibration of relation probabilities using Isotonic Regression. Calibrated probability added as an extra column: cProb. Now prob 0.9 indicates that 9 out of 10 relation would be true.
  • Changed the relation types to the highest CTD parent:
    • CHEMICAL-DISEASE: therapeutic, marker/mechanism
    • GENE-DISEASE: therapeutic, marker/mechanism
    • CHEMICAL-GENE: expression, reaction, metabolic processing, activity, binding, response to substance, cotreatment, transport, therapeutic, localization

The link

Release of April 16, 2020

Release description: the first public release of the database. The link.