The Semantic Scholar Open Data Platform
- URL: http://arxiv.org/abs/2301.10140v1
- Date: Tue, 24 Jan 2023 17:13:08 GMT
- Title: The Semantic Scholar Open Data Platform
- Authors: Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy,
Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra,
Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason
Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David
Graham, Fangzhou Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Bailey
Kuehl, Michael Langan, Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Kelsey
MacMillan, Tyler Murray, Chris Newell, Smita Rao, Shaurya Rohatgi, Paul
Sayre, Zejiang Shen, Amanpreet Singh, Luca Soldaini, Shivashankar
Subramanian, Amber Tanaka, Alex D. Wade, Linda Wagner, Lucy Lu Wang, Chris
Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Madeleine Van Zuylen,
Daniel S. Weld
- Abstract summary: Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
- Score: 79.4493235243312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The volume of scientific output is creating an urgent need for automated
tools to help scientists keep up with developments in their field. Semantic
Scholar (S2) is an open data platform and website aimed at accelerating science
by helping scholars discover and understand scientific literature. We combine
public and proprietary data sources using state-of-the-art techniques for
scholarly PDF content extraction and automatic knowledge graph construction to
build the Semantic Scholar Academic Graph, the largest open scientific
literature graph to-date, with 200M+ papers, 80M+ authors, 550M+
paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced
semantic features such as structurally parsed text, natural language summaries,
and vector embeddings. In this paper, we describe the components of the S2 data
processing pipeline and the associated APIs offered by the platform. We will
update this living document to reflect changes as we add new data offerings and
improve existing services.
Related papers
- ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity [13.978222668670192]
ByteScience is a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform.
It is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora.
The platform achieves remarkable accuracy with only a small amount of well-annotated articles.
arXiv Detail & Related papers (2024-11-18T19:36:26Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation [20.994565065595232]
We present a new corpus to facilitate the automated generation of scientific news reports.
Our dataset comprises academic publications and their corresponding scientific news reports across nine disciplines.
We benchmark our dataset employing state-of-the-art text generation models.
arXiv Detail & Related papers (2024-03-26T14:54:48Z) - The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics [0.0]
We introduce the new comprehensive Open Review-Based dataset (ORB)
It includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions.
This paper presents our data architecture and an overview of the collected data along with relevant statistics.
arXiv Detail & Related papers (2023-11-29T20:52:02Z) - PubGraph: A Large-Scale Scientific Knowledge Graph [11.240833731512609]
PubGraph is a new resource for studying scientific progress that takes the form of a large-scale knowledge graph.
PubGraph is comprehensive and unifies data from various sources, including Wikidata, OpenAlex, and Semantic Scholar.
We create several large-scale benchmarks extracted from PubGraph for the core task of knowledge graph completion.
arXiv Detail & Related papers (2023-02-04T20:03:55Z) - Citation Trajectory Prediction via Publication Influence Representation
Using Temporal Knowledge Graph [52.07771598974385]
Existing approaches mainly rely on mining temporal and graph data from academic articles.
Our framework is composed of three modules: difference-preserved graph embedding, fine-grained influence representation, and learning-based trajectory calculation.
Experiments are conducted on both the APS academic dataset and our contributed AIPatent dataset.
arXiv Detail & Related papers (2022-10-02T07:43:26Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - PubSqueezer: A Text-Mining Web Tool to Transform Unstructured Documents
into Structured Data [0.0]
I present a web tool which uses a Text Mining strategy to transform unstructured biomedical articles into structured data.
generated results give a quick overview on complex topics which can possibly suggest not explicitly reported information.
I show how a literature based analysis conducted with PubSqueezer results allows to describe known facts about SARS-CoV-2.
arXiv Detail & Related papers (2020-11-05T22:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.