Continuing Academic Search Engines: a quantitative Outlook, a book that explored the most relevant academic search engines from a quantitative view, this post attempts to review and analyse a new search engine addressed to scholarly community. Semantic Scholar is a new web-based search product that was born in 2015. For now this search engine only covers research papers on Computer Sciences, but they expect to index more papers from additional scientific areas over the next few years. It is developed by the Allen Institute for Artificial Intelligence, a company specialized in solutions based on artificial intelligence.

In this sense, Semantic Scholar introduces several original features that enhance the searching of scientific literature. For example, it allows to distinguish different types of citations. Analysing the context in which the citation is inserted in the text, Semantic Scholar identifies the importance that certain papers have had for other articles. Therefore, it enables to value research papers by the number of times that influence other works. Other important novelty is the extraction of tables and figures that allow to describe in detail the content of each article. This characteristic makes easy the exploration of the results. In addition, Semantic Scholar introduces a new bibliometric indicator, Citation Velocity, which illustrates the average of citations that one paper receives in the last three years. These innovations improve the ranking of results and make possible the selection of quality papers.

The searching structure of Semantic Scholar is however quite simple and from this point of view it does not introduce any novel issue. Even more, its retrieving mechanism is poor and insufficient because it only employs exact matching. Thus, it does not enable to build any complex query because it does not accept truncations, shortcuts, Booleans queries neither any type of query syntax. This problem gets worse because it does not include any help page neither an advance search service. This fact makes impossible to design specific and complex queries that retrieve precise or exhaustive results.

Semantic Scholar tries to supply this limitation incorporating a filtering system that makes only possible to refine the results. Six filters are employed to reduce the number of results:

Overviews: this check distinguishes overviews or review articles that broadly describe the last advances in a discipline. Thus, it is possible to retrieve seminal works when we need to take a general picture on a discipline. There is not any explanation on what Semantic Scholar considers an overview or at least which elements are used to classify this type of articles. But I expect that this procedure goes beyond to identify survey, overview, review, etc., in the title of the paper.
Publication year: this filter presents a bar chart with the publication years of the document retrieved. It is possible selecting a range of years or only one year. The last year is 1975, so previous papers are not included in the search engine.
Data set used: this is a useful filter for machine learning studies because it permits to select works that have employed some of these data sets for their studies. LIBSVM Data: Classification, Regression, and Multi-label, UCI Machine Learning Repository, The PASCAL Visual Object Classes (VOC) and Aima-data from the book Artificial Intelligence: A Modern Approach are the most important data sets that this search engine cover.
Author: Semantic Scholar extracts the authors of the publications and produces an authors index through which selects the papers authored by one person. Checking me, I see that the system use my full name (i.e. José Luis Ortega Priego), when I usually sign as José Luis Ortega. The only place that uses my full name is DBLP (Digital Bibliography & Library Project) from the University of Trier. Therefore, it is possible that this search engine is using at least the authors index of this database. However, Figure 1 shows that it not only employs this index, but many of the records come as well from this database. In an open world, where data sets from several services are freely used by other ones (i.e. mash up), it is not strange this situation. However, and from my view, I do not consider ethical to use data without a clear mention on the origin of the sources. I think that this is an important question that would be confused on which developments are truly originals, and which are borrowed.

Figure 1. the same authors index for Semantic Scholar and DBLP

Publication venue: This is a very important filter because it allows to select publications by the source. For a scholarly user, this is a very important element because it allows to know where have been published these papers. The most important source is Arxiv.org (4%), the most important repository for Physics, Computer Sciences and related disciplines. However, the main drawback is that most of the papers (90% approx.) do not present their publication venue and it is impossible to know whether they have been published in a journal, conference, book or they are simply unpublished materials.
Key Phrase: this is the last filter and it lists the most frequent word chains to identify the content of the papers. There are extracted by the system and they are different to the author keywords.

Semantic Scholar employs several criteria to rank the results, which are combined in different forms and with different weights. At first glance, the most relevant criterion is the number of received citations, furthermore if the article has an elevated citation velocity. Next, the position of the query words in the text is also a criterion to rank the papers. Thus, title, authors and abstract are the most important places in that order. However, Semantic Scholar does not offer any other alternative method to rank the results such as age, alphabetical order, etc. One of the problems of giving too much importance to citations when it comes to rank results is that the first results could be occupied by old seminal papers with a lot of citations but irrelevant for the query. This frequent problem in Google Scholar is solved by Semantic Scholar with citation velocity, because it prioritizes papers with an elevated citation rate but only in the recent years. However, this criterion is not entirely the most suitable because it does not avoid that early papers without citations are set in last positions.

Lastly, Semantic Scholar also has a web crawler that extracts full-text copies of these records from the public web. The site does not provide any information about the sources that supply documents to its database, so it very hard to know how many documents come from repositories (i.e. Arxiv.org), document sharing platforms (i.e. ResearchGate or Academia.edu) or any other web site. These full-text documents are stored in Amazon WebServices (AWS) and they could be around a 70% of the total number of records. As CiteSeerX, Semantic Scholar also acts as a repository, warehousing documents from different sites and platforms. These documents are mainly used to extract the citations.

Summarizing, Semantic Scholar is an interesting solution that employs new developments in artificial intelligence to enhance the searching of academic materials. The technical contributions of this engine are important because they allow detecting different types of citations; automatically extracting tables, figures and key phrases and distinguish papers by their content (i.e. Overviews). The proposal of a new bibliometric indicator, Citation Velocity, improves the ranking of documents and introduces a new way to evaluate articles by their recent impact. However, the principal criticism to this engine comes from its conceptual approach. As it happens with many of the current academic search engines, Semantic Scholar obviates some critical elements necessary for the retrieving of scientific documents and it applies a generalist mechanism that does not entirely satisfy the demand of the scholarly community. For example, most of the papers lack of publication source. It is possible that the source of a document could be irrelevant in a generalist search engine, but for a scholarly engine this information is fundamental to identifying and valuing the retrieved documents. It is not the same an unpublished paper hosted in a repository than an article published in a peer-review journal. In the case of Semantic Scholar, only a small fraction of records include that information and the remaining ones do not even present a link about their origin.

Another important problem is the absence of a robust search system that makes possible the building of advance queries. It is true that the use of the advanced search is very unusual in generalist search engines, but in an academic one is essential. Academic users rarely use transactional queries, they need to track in detail the entire literature in their research area and detect relevant papers on a precise field. Thus, academic users need a flexible system that enables to include/exclude search terms (Boolean operators) and look into specific fields (shortcuts or field search). The filtering system that Semantic Scholar employs is very positive but it is not enough to satisfy the researcher needs.

In any way, this service has just started and many of their developments are still in beta phase. I just hope that these critical comments just become discussion places where thinking about the current panorama of academic search engines and future developments.

lunes, 16 de noviembre de 2015

Semantic Scholar

Experts in Mining the Web

Web data extraction

Extracción de datos web