lunes, 16 de noviembre de 2015

Semantic Scholar



Continuing Academic Search Engines: a quantitative Outlook, a book that explored the most relevant academic search engines from a quantitative view, this post attempts to review and analyse a new search engine addressed to scholarly community. Semantic Scholar is a new web-based search product that was born in 2015. For now this search engine only covers research papers on Computer Sciences, but they expect to index more papers from additional scientific areas over the next few years. It is developed by the Allen Institute for Artificial Intelligence, a company specialized in solutions based on artificial intelligence.
In this sense, Semantic Scholar introduces several original features that enhance the searching of scientific literature. For example, it allows to distinguish different types of citations. Analysing the context in which the citation is inserted in the text, Semantic Scholar identifies the importance that certain papers have had for other articles. Therefore, it enables to value research papers by the number of times that influence other works. Other important novelty is the extraction of tables and figures that allow to describe in detail the content of each article. This characteristic makes easy the exploration of the results. In addition, Semantic Scholar introduces a new bibliometric indicator, Citation Velocity, which illustrates the average of citations that one paper receives in the last three years. These innovations improve the ranking of results and make possible the selection of quality papers.  
The searching structure of Semantic Scholar is however quite simple and from this point of view it does not introduce any novel issue. Even more, its retrieving mechanism is poor and insufficient because it only employs exact matching. Thus, it does not enable to build any complex query because it does not accept truncations, shortcuts, Booleans queries neither any type of query syntax. This problem gets worse because it does not include any help page neither an advance search service. This fact makes impossible to design specific and complex queries that retrieve precise or exhaustive results.
Semantic Scholar tries to supply this limitation incorporating a filtering system that makes only possible to refine the results. Six filters are employed to reduce the number of results:

  • Overviews: this check distinguishes overviews or review articles that broadly describe the last advances in a discipline. Thus, it is possible to retrieve seminal works when we need to take a general picture on a discipline. There is not any explanation on what Semantic Scholar considers an overview or at least which elements are used to classify this type of articles. But I expect that this procedure goes beyond to identify survey, overview, review, etc., in the title of the paper.
  • Publication year: this filter presents a bar chart with the publication years of the document retrieved. It is possible selecting a range of years or only one year. The last year is 1975, so previous papers are not included in the search engine.
  • Data set used: this is a useful filter for machine learning studies because it permits to select works that have employed some of these data sets for their studies. LIBSVM Data: Classification, Regression, and Multi-label, UCI Machine Learning Repository, The PASCAL Visual Object Classes (VOC) and Aima-data from the book Artificial Intelligence: A Modern Approach are the most important data sets that this search engine cover.
  • Author: Semantic Scholar extracts the authors of the publications and produces an authors index through which selects the papers authored by one person. Checking me, I see that the system use my full name (i.e. José Luis Ortega Priego), when I usually sign as José Luis Ortega. The only place that uses my full name is DBLP (Digital Bibliography & Library Project) from the University of Trier. Therefore, it is possible that this search engine is using at least the authors index of this database. However, Figure 1 shows that it not only employs this index, but many of the records come as well from this database. In an open world, where data sets from several services are freely used by other ones (i.e. mash up), it is not strange this situation. However, and from my view, I do not consider ethical to use data without a clear mention on the origin of the sources. I think that this is an important question that would be confused on which developments are truly originals, and which are borrowed.


Figure 1. the same authors index for Semantic Scholar and DBLP

  • Publication venue: This is a very important filter because it allows to select publications by the source. For a scholarly user, this is a very important element because it allows to know where have been published these papers. The most important source is Arxiv.org (4%), the most important repository for Physics, Computer Sciences and related disciplines. However, the main drawback is that most of the papers (90% approx.) do not present their publication venue and it is impossible to know whether they have been published in a journal, conference, book or they are simply unpublished materials.
  • Key Phrase: this is the last filter and it lists the most frequent word chains to identify the content of the papers. There are extracted by the system and they are different to the author keywords.

Semantic Scholar employs several criteria to rank the results, which are combined in different forms and with different weights. At first glance, the most relevant criterion is the number of received citations, furthermore if the article has an elevated citation velocity. Next, the position of the query words in the text is also a criterion to rank the papers. Thus, title, authors and abstract are the most important places in that order. However, Semantic Scholar does not offer any other alternative method to rank the results such as age, alphabetical order, etc. One of the problems of giving too much importance to citations when it comes to rank results is that the first results could be occupied by old seminal papers with a lot of citations but irrelevant for the query. This frequent problem in Google Scholar is solved by Semantic Scholar with citation velocity, because it prioritizes papers with an elevated citation rate but only in the recent years. However, this criterion is not entirely the most suitable because it does not avoid that early papers without citations are set in last positions.
Lastly, Semantic Scholar also has a web crawler that extracts full-text copies of these records from the public web. The site does not provide any information about the sources that supply documents to its database, so it very hard to know how many documents come from repositories (i.e. Arxiv.org), document sharing platforms (i.e. ResearchGate or Academia.edu) or any other web site. These full-text documents are stored in Amazon WebServices (AWS) and they could be around a 70% of the total number of records. As CiteSeerX, Semantic Scholar also acts as a repository, warehousing documents from different sites and platforms. These documents are mainly used to extract the citations.
Summarizing, Semantic Scholar is an interesting solution that employs new developments in artificial intelligence to enhance the searching of academic materials. The technical contributions of this engine are important because they allow detecting different types of citations; automatically extracting tables, figures and key phrases and distinguish papers by their content (i.e. Overviews). The proposal of a new bibliometric indicator, Citation Velocity, improves the ranking of documents and introduces a new way to evaluate articles by their recent impact. However, the principal criticism to this engine comes from its conceptual approach. As it happens with many of the current academic search engines, Semantic Scholar obviates some critical elements necessary for the retrieving of scientific documents and it applies a generalist mechanism that does not entirely satisfy the demand of the scholarly community. For example, most of the papers lack of publication source. It is possible that the source of a document could be irrelevant in a generalist search engine, but for a scholarly engine this information is fundamental to identifying and valuing the retrieved documents. It is not the same an unpublished paper hosted in a repository than an article published in a peer-review journal. In the case of Semantic Scholar, only a small fraction of records include that information and the remaining ones do not even present a link about their origin.
Another important problem is the absence of a robust search system that makes possible the building of advance queries. It is true that the use of the advanced search is very unusual in generalist search engines, but in an academic one is essential. Academic users rarely use transactional queries, they need to track in detail the entire literature in their research area and detect relevant papers on a precise field. Thus, academic users need a flexible system that enables to include/exclude search terms (Boolean operators) and look into specific fields (shortcuts or field search). The filtering system that Semantic Scholar employs is very positive but it is not enough to satisfy the researcher needs. 
In any way, this service has just started and many of their developments are still in beta phase. I just hope that these critical comments just become discussion places where thinking about the current panorama of academic search engines and future developments.

Experts in Mining the Web

Web data extraction

Extracción de datos web 

jueves, 12 de febrero de 2015

Google Scholar Citation 2015 report


I think that the best way to inaugurate a publication on academic web sites is to post a detailed analysis on one of the most relevant scholarly platforms. Google Scholar Citations (GSC) is a Google Scholar’s service that facilitates the creation of a brief publishing curriculum from documents indexed in Google Scholar. Since December 2011, I have annually gathered exhaustive samples of Google Scholar profiles, extracting the most complete list of profiles with their identification data, interests, collaborators and bibliometrics indicators --Needless to say, that those data are available for collaboration. The last crawler was carried out between December 2014 (crawling) and January 2015 (harvesting). Now, this post resumes the information retrieved in this crawler as a current report on the coverage of Google Scholar Citations.
596,105 profiles were obtained in this last sample, 101% more than in 2014 (296,205). This growth percentage evidences a good health for Google Scholar Citations because the adscription of new profiles is doubled in just one year. In spite of this, figures are far from other successful scholarly sites such as Academia.edu (+1,2 millions) or ResearchGate (5 millions). The growing rate of GSC predicts a soon comparison with other social platforms. Although GSC cannot be actually considered a social network, the way in which the users joint the service is comparable with an academic social network.

Labels

Table 1. The ten most frequent labels

According to the labels that each researcher includes in his/her profile, these have increased a 204% since January 2014. Among the most predominant terms we find keywords related with Computer and Information Sciences. Thus, the most frequent keywords are machine learning (.7%), artificial intelligence (.5%) and bioinformatics (.4%). Only, neuroscience (.3%), ecology (.3%) and nanotechnology (.2%) are terms linked to other disciplines. These results are similar to the already observed ones in 2011 (Ortega and Aguillo, 2012), so it informs us that the population in GSC has not changed too much in thematic terms. However, the label that most increases its presence is neuroscience (212%), which suggests that new member from different disciplines to Computer Sciences are joint up with GSC.

Organizations

Table 2. The ten organizations with most profiles

By organizations, universities with more profiles are Universidade de Sao Paulo (1.08%), Harvard University (.6%) and Stanford University (.47%). It is interesting to notice that among the first ten organizations outstand two Brazilian universities, Universidade de Sao Paulo (1.08%) and Universidade Estadual Paulista (.44%) which evidences the strong presence of Brazilian profiles in GSC. Apart of this, the list prevails universities from United States (6), United Kingdom (1) and Canada (1). According to the 2014 sample, the organizations that more increase their profiles are Stanford University (256%) and Harvard University (184%), while the University of Michigan (56%) and Universidade Estadual Paulista (52%) are the ones that less grow.
Observing bibliometric indicators, universities with a better citations/papers ratio are Harvard University (41.5 citations per document) and Stanford University (41.4 citations per document). Contrarily, Universidade Estadual Paulista (6.6) and Universidade de Sao Paulo (10.2) are the ones with the worst rates.

Countries

Table 3. The ten countries with most profiles (*2010, **2008)

Figure 1. Map of profiles by country

Finally, the country distribution shows that United States (19.8%) is the country with most profiles, far away from United Kingdom (5.6%) and Brazil (4%). As in organizations table, country table confirms the strong presence of Brazilian profiles, which climb to the third position in GSC. According to 2014, the countries that undergo a larger increase are India (807%), Spain (419%) and Germany (375%), while Brazil (134%) slows down its growth beside United States (142%) and United Kingdom (163%).
To normalize these percentages, I have developed an indicator to measure the penetration degree of GSC in a country. It is the result of divide the number of profiles by the total amount of researchers in each country (UNESCO, 2015). Thus, the countries with a highest penetration are Australia (21%), followed by Brazil (16.5%) and Italy (16.2%). Whereas GSC is less successful in Germany (4.3%) and France (5.5%). However, these figures have to be cautiously considered because data on Australia are from 2008, and Brazil and India from 2010. Therefore, the penetration of these countries could be lower.
Summarizing, we can extract some conclusions on these data:
  • GSC is still growing, doubling its population in a year. But this growth is still far from other academic sites such as ResearchGate and Academia.edu.
  • The most frequent labels are related with Computer and Information Sciences, although it is observed a quick increase of terms from Medicine (neurosciences), Biology (ecology) and Materials Science (nanotechnology).
  • Universidade de Sao Paulo is still the organization with more profiles, spite of American universities such as Harvard University and Stanford emerge again with force.
  • United States is the country with most profiles, with a strong increase of Indian profiles. Australia and Brazil are the countries where GSC is most popular, while Germany and France show less interest in this service.

References

Ortega, J. L., Aguillo, I. F. (2012),Science is all in the eye of the beholder: keyword maps in Google Scholar Citations, Journal of the American Society for Information Science and Technology, 63(12): 2370-2377

UNESCO Institute for Statistics (2015). http://data.uis.unesco.org

Please, cite this post as: Ortega, J. L. (2015). Google Scholar Citations 2015 report. The Scientific Web Observer. http://swobserver.blogspot.com.es/2015/02/google-scholar-citation-2015-report.html