Publications analysis service
The number of papers published each year is growing steadily, so it becomes unfeasible for a single person to be aware of all the publications in the field of interest. Review papers solve this problem to some extent, but they cannot cover all the recent releases, focusing only on those with significant impact. The necessity and demand for tools and methods to get a birds-eye view of the scientific area covering all recent works is growing.
Pubtrends is a new scientific publications analysis service. Service is available at http://bit.ly/pubtrends
This is an exploratory tool for researchers providing faster trends analysis and breakthrough papers discovery among the steadily growing flow of papers worldwide. The service aims to solve three tasks: give a brief overview of the field, explore popular trends in publications, and help to find new promising directions. At the moment, the service incorporates the Pubmed database and the Semantic Scholar archive. Semantic Scholar* aggregates significant journals and publishers, including Springer Nature, ACM, etc. Together, the two contain 200 mln papers and 800 mln references.
*Semantic Scholar is disabled now.
Let's imagine that we are trying to write a review of human aging. First of all, type "human aging" into the search field. User can use double quotes wrapping to search for exact phrase or find for documents in the Pubmed database, which contain all the words in the query. The number of papers can be quite significant, so it's natural to use some ranking and filtering to focus on either most cited, most recent, or most relevant articles. A threshold can be configured from the main page.
Click Take a tour to view quick tutorial on usage.
We are interested in most cited papers. The tool looks for documents in the local copy of the Pubmed database ranking all the documents by citations number and picking top ones for further analysis. Most recent papers option is self-describing, while most recent option returns articles with most frequent inclusions of the search query. The web application is designed to support multiple simultaneous analyses, so the user adds its job to a queue by clicking the Search button.
Analysis process consists of several steps: searching for documents, ranking and filtering, collecting citations statistics, build citations and co-citations graphs, extracting subtopics, finding popular journals, and authors. Processing can take some time, so please be patient at this step since service is under substantial development.
Once processing is finished browser will be redirected to a result page. This page contains all the analytics and consists of several parts: Overview, Highlights, Topics, Trends, Publications, Authors, and Journals.
Overview demonstrates a birds-eye view of the field, including the total number of articles, citations, and extracted topics. Word cloud shows the most frequent words in titles and abstracts. Also, it contains a summary plot of papers per year. Please note that the word cloud component is clickable, and you can navigate to documents containing the selected word. Articles can be viewed as a plain list, as well.
Here we can see that 1000 most cited papers were analyzed with more than 150 thousand citations, and seven separate subtopics were detected.
You can always click on Show as list button to visualize papers in good-old table style. You can search, sort etc.
The highlight section contains an interactive visualisation of top-cited papers, organised by number and citations count. Different types of articles are shown in different colours.
Two other plots demonstrate top paper of the year and quickest growth of the year.
All the papers in top-cited graphs are clickable, and we can explore details on a separate page. On the figure you can see "The hallmarks of aging" paper info. Firstly, essential information is given as authors, paper title, journal, and abstract. It is followed by citations dynamics plot and advanced analysis, including most significant citing and cited papers and connection to other subtopics. Valuable references and citations are computed using the Pagerank algorithm for document ranking initially developed for web pages by Google co-founder Larry Page.
The next sections of the results page, except Authors and Journals, are dedicated to subtopics analysis. You can explore the sizes and relationships between subtopics in this section.
Subtopics are closely related groups of documents. Bibliometrics methods are used to extract topics from the pool of articles. Three citations based methods are used to compute papers similarity: direct citations, co-citations, and bibliographic coupling. Community detection algorithm is used to extract subtopics. Small subtopics are merged if they contain less than five percent of all documents. Not all the papers can be assigned to subtopic with citation based methods only. To overcome this limitation we use text similarity to assign topic for out of citation graph papers. Overall structure within a research field can be visualised as a graph. We use hierarchical structure from Louvain Community detection algorithm and show similarities within each community applying Local Sparsification method to reduce number of edges, also we show most similar papers from different communities (topics).
Click this button to explore structure of the research field.
This structure graph can be used for visual inspection of paper similarity, it allows you to find out the most important spots.
Let's go back to the report page. Topics section contains information on topic sizes, topics similarity, etc.
One of the important questions is the quality of separation into topic, we show heatmap with information about mean similarity between all the papers by topics.
Ideal case is when you see only diagonal highlighted, i.e. similarity between papers within a single topic is much higher that with other topics.
This is the heatmap in our case- you can see that overall quality of topics extraction is quite good.
This barplot shows the size of each topic.
Section Trends is an idea that gave birth to the Pubtrends. It shows the evolution of trends for the timeline. Explore retrospective and fast-growing directions on the plot below.
After all, it's time to look at all the publications! For each subtopic, the application shows familiar to users word cloud and articles plot. Word cloud is built from terms specific to the given topic with respect to others. These words are computed using TF-IDF normalization, a standard approach in the field of natural language processing. The more important word is the more significant fraction of papers contains it.
In our group, we are focusing on epigenetic changes inherently bound to cellular aging and development processes. So we are particularly interested in Subtopic eight - "methylation", "epigenetic", etc. The service provides two options: view all articles (Show as list) or perform a similar analysis for a selected subtopic (Zoom into).
The last two sections contain the most famous authors and most relevant journals info.
The project is being developed by Oleg Shpynov (JetBrains Research) together with students Nikolay Kapralov and Anna Vlasova as a Computer Science Center term project and JetBrains summer internship 2019.
At the moment, together with Anna Nikiforovskay (Higher School of Economics) and Alexei Shpilman (JetBrains Research) we are working on automated literature review generation for given search query using Deep Learning natural language processing approaches.
Quite a few ideas were implemented and available so far in the application. More advanced trends analysis, paper impact prediction, and semi-automated review creation are in the nearest future. Stay tuned for updates.