Publications analysis service
The number of papers published each year is growing steadily, so it becomes unfeasible for a single person to be aware of all the publications in the field of interest. Review papers solve this problem to some extent, but they cannot cover all the recent releases, focusing only on those with significant impact. The necessity and demand for tools and methods to get a birds-eye view of the scientific area covering all recent works is growing.
Pubtrends is a new scientific publications analysis service. Service is available at http://bit.ly/pubtrendsThis is an exploratory tool for researchers providing faster trends analysis and breakthrough papers discovery among the steadily growing flow of papers worldwide. The service aims to solve three tasks: give a brief overview of the field, explore popular trends in publications, and help to find new promising directions. At the moment, the service incorporates the Pubmed database and the Semantic Scholar archive. Semantic Scholar* aggregates significant journals and publishers, including Springer Nature, ACM, etc. Together, the two contain 200 mln papers and 800 mln references.
*Semantic Scholar is disabled now.
Let's imagine that we are trying to write a review of human aging. First of all, type "human aging" into the search field. User can use double quotes wrapping to search for exact phrase or find for documents in the Pubmed database, which contain all the words in the query. The number of papers can be quite significant, so it's natural to use some ranking and filtering to focus on either most cited, most recent, or most relevant articles. A threshold can be configured from the main page.
We are interested in most cited papers. The tool looks for documents in the local copy of the Pubmed database ranking all the documents by citations number and picking top ones for further analysis. Most recent papers option is self-describing, while most recent option returns articles with most frequent inclusions of the search query. The web application is designed to support multiple simultaneous analyses, so the user adds its job to a queue by clicking the Search button.
Analysis process consists of several steps: searching for documents, ranking and filtering, collecting citations statistics, build citations and co-citations graphs, extracting subtopics, finding popular journals, and authors. Processing can take some time, so please be patient at this step since service is under substantial development.
Once processing is finished browser will be redirected to a result page. This page contains all the analytics and consists of several parts: Overview, Highlights, Topics, Trends, Publications, Authors, and Journals.
Overview demonstrates a birds-eye view of the field, including the total number of articles, citations, and extracted topics. Word cloud shows the most frequent words in titles and abstracts. Also, it contains a summary plot of papers per year. Please note that the word cloud component is clickable, and you can navigate to documents containing the selected word. Articles can be viewed as a plain list, as well.
Here we can see that 5000 most cited papers were analyzed with more than 200 thousand citations, and eight separate subtopics were detected.
The highlight section contains an interactive visualization of top-cited papers, organized by number and citations count. Different types of articles are shown in different colors.
Two other plots demonstrate top paper of the year and quickest growth of the year.
All the papers in top-cited graphs are clickable, and we can explore details on a separate page. On the figure you can see "The hallmarks of aging" paper info. Firstly, essential information is given as authors, paper title, journal, and abstract. It is followed by citations dynamics plot and advanced analysis, including most significant citing and cited papers and connection to other subtopics. Valuable references and citations are computed using the Pagerank algorithm for document ranking initially developed for web pages by Google co-founder Larry Page.
The next sections of the results page, except Authors and Journals, are dedicated to subtopics analysis. Subtopics are closely related groups of documents. Bibliometrics methods are used to extract topics from a pool of articles based on co-citations graph clustering with the Louvain algorithm for community detection. Please note, small subtopics are merged if they contain less than five percent of all documents. You can explore the sizes and relationships between subtopics in this section.
Section Trends is an idea that gave birth to the Pubtrends.
It shows the evolution of trends for the timeline. Explore retrospective and fast-growing directions on the plot below.
After all, it's time to look at all the publications! For each subtopic, the application shows familiar to users word cloud and articles plot. Word cloud is built from terms specific to the given topic with respect to others. These words are computed using TF-IDF normalization, a standard approach in the field of natural language processing. The more important word is the more significant fraction of papers contains it. You may think of TF-IDF as of Z-score normalization for NLP.
In our group, we are focusing on epigenetic changes inherently bound to cellular aging and development processes. So we are particularly interested in Subtopic eight - "methylation", "epigenetic", etc. The service provides two options: view all articles (Show as list) or perform a similar analysis for a selected subtopic (Zoom into).
The last two sections contain the most famous authors and most relevant journals info.
Quite a few ideas were implemented and available so far in the application. More advanced trends analysis, paper impact prediction, and semi-automated review creation are in the nearest future.
Pubtrends application is being developed by Oleg Shpynov (JetBrains Research) together with Nikolai Kapralov, Anna Vlasova (Computer Science Center).
Please send your questions or comments at oleg[dot]shpynov[at]gmail[dot]com.