Paper-analyzer is a web-based application that performs search queries on a collection of 30 million PubMed paper abstracts. A search query includes a gene ID (according to NCBI gene id) and (or) a MeSH(Medical Subject Heading). Users can also specify taxon names for genes, add context and author names to narrow down the search. We trained the model to search for relations between entities participating in search queries. Right now, we can find connections between genes and diseases, chemicals and genes, chemicals and diseases. There are 13 types of relations such as marker-mechanism relations, therapeutic effect, increase or decrease in expression, activity or metabolic processing, and so on. There are many cases when abstracts don't contain explicit statements about the presence of relationships between entities in question. We trained a Natural Language Understanding model based on Transformer architecture called BERT to address this problem. We took positive relation examples from the Comparative Toxicogenomics Database (CTD) to train the model. We used the PubTator application for named entity recognition and entity name normalization tasks, but we plan to substitute it with our own NER system shortly. We also plan to include gene-gene relations from Reactome in our search system.
We preprocess all the abstracts and store relations in a database.
After submitting the query, the user gets a list of resulting papers aggregated by relation endpoints and types. One can collapse relation types and sort the search results by score (model confidence), publication year, or number of papers in a group.
Users can explore search results at the level of particular abstracts by selecting papers grouped by relation types. One can filter abstracts by publication year using the histogram. We also provide detailed information about the papers and links to PubMed and PubTator.
We are now working on extracting additional information about entities and relations from article text. As for now, one can see contexts found in sentences containing both entities that form a relation.
Extracted Relations Database
As a result of Relation Extraction model application to PubMed abstracts we obtained a database of extracted relations. We are going to update this database when the model changes.
The RE database is a tsv file with columns:
- ‘NameFrom’ - relation tail entity name;
- ‘IdFrom’ - relation tail NCBI/MESH id;
- ‘GroupFrom’ - relation tail group name (chemical, disease, gene);
- ‘NameTo’ - relation head entity name;
- ‘IdTo’ - relation head NCBI/MESH id;
- ‘GroupTo’ - relation tail group name (chemical, disease, gene);
- ‘Relation’ - relation type (see below the list of relation types);
- ‘PMID’ - PMID of the paper, where the relation was found;
- ‘Prob’ - model credibility of the extracted relation.
We analyze the following classes of relation types:
These types represent a subset of types mentioned in the CTD database.
Release of June 18, 2020
- Improved performance of relation extraction with BioBERT.
- Added calibration of relation probabilities using Isotonic Regression. Calibrated probability added as an extra column 'cProb'. Now a probablity of 0.9 indicates that 9 out of 10 relations would be true.
- Changed the relation types to the highest CTD parent:
- CHEMICAL-DISEASE: therapeutic, marker/mechanism;
- GENE-DISEASE: therapeutic, marker/mechanism;
- CHEMICAL-GENE: expression, reaction, metabolic processing, activity, binding, response to substance, cotreatment, transport, therapeutic, localization.
The database can be downloaded here.
Release of April 16, 2020
Release description: the first public release of the database. Can be downloaded here.