Paper-analyzer is a web-based application that performs search queries on a collection of 30 million PubMed paper abstracts. The search query includes gene (according to NCBI gene id) and (or) MeSH (Medical Subject Heading). Users can also specify taxon names for a gene and type the text context and author name to narrow the search. We trained the model to perform a search for relations between entities, participating in the search queries. Right now, we can find connections between genes and diseases, chemicals and genes, and chemicals and diseases. There are 13 types of relations like a marker - mechanism, therapeutic, increase or decrease expression, activity or metabolic processing, and so on. There are many examples when the abstract doesn't contain an explicit statement about the presence of relationships between entities in question. We trained a Natural Language Understanding model based on Transformer architecture called BERT to address this problem. We took positive relation examples from the Comparative Toxicogenomics Database (CTD) to train the model. We use PubTator application for named entity recognition and entity name normalization tasks, but we plan to substitute it with our own NER system shortly. We also plan to add gene-gene relations from Reactome to our search system.
We preprocess all the abstracts preliminarily and store relations in a database.
After submitting the query, the user gets a list of resulting papers, aggregated by relation endpoints and types. One can collapse relation types and sort search results by a score (model confidence), publication year, or a number of papers in a group.
Users can explore the search result at a level of particular abstracts by selecting papers grouped by relation types. One can filter abstracts by publication year using a histogram. We also provide detailed information about a paper and links to PubMed and PubTator.
We are working now on extracting additional information about entities and relations from text. As for now, one can see contexts, found in sentences, containing both entities, forming a relation.
Extracted Relations Database
As the result of Relation Extraction model application to PubMed abstracts we obtained a database of extracted relations. We are going to update this database in case of model changes.
The RE database is a tsv file with columns
- ‘NameFrom’ - relation tail entity name;
- ‘IdFrom’ - relation tail NCBI/MESH id;
- ‘GroupFrom’ - relation tail group name (chemical, disease, gene);
- ‘NameTo’ - relation head entity name;
- ‘IdTo’ - relation head NCBI/MESH id;
- ‘GroupTo’ - relation tail group name (chemical, disease, gene);
- ‘Relation’ - relation type (see below the list of relation types);
- ‘PMID’ - PMID of the paper, where the relation is found;
- ‘Prob’ - model credibility of extracted relation.
We consider the next classes of relation types:
This types are a subset of types mentioned in CTD database.
Release of June 18, 2020
- Improved performance of relation extraction with BioBERT.
- Added calibration of relation probabilities using Isotonic Regression. Calibrated probability added as an extra column: cProb. Now prob 0.9 indicates that 9 out of 10 relation would be true.
- Changed the relation types to the highest CTD parent:
- CHEMICAL-DISEASE: therapeutic, marker/mechanism
- GENE-DISEASE: therapeutic, marker/mechanism
- CHEMICAL-GENE: expression, reaction, metabolic processing, activity, binding, response to substance, cotreatment, transport, therapeutic, localization
Release of April 16, 2020
Release description: the first public release of the database. The link.