SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software
Mentions in Scientific Articles
- URL: http://arxiv.org/abs/2108.09070v1
- Date: Fri, 20 Aug 2021 08:53:03 GMT
- Title: SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software
Mentions in Scientific Articles
- Authors: David Schindler, Felix Bensmann, Stefan Dietze and Frank Kr\"uger
- Abstract summary: SoMeSci is a knowledge graph of software mentions in scientific articles.
It contains high quality annotations (IRR: $kappa=.82$) of 3756 software mentions in 1367 PubMed Central articles.
- Score: 1.335443972283229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge about software used in scientific investigations is important for
several reasons, for instance, to enable an understanding of provenance and
methods involved in data handling. However, software is usually not formally
cited, but rather mentioned informally within the scholarly description of the
investigation, raising the need for automatic information extraction and
disambiguation. Given the lack of reliable ground truth data, we present
SoMeSci (Software Mentions in Science) a gold standard knowledge graph of
software mentions in scientific articles. It contains high quality annotations
(IRR: $\kappa{=}.82$) of 3756 software mentions in 1367 PubMed Central
articles. Besides the plain mention of the software, we also provide relation
labels for additional information, such as the version, the developer, a URL or
citations. Moreover, we distinguish between different types, such as
application, plugin or programming environment, as well as different types of
mentions, such as usage or creation. To the best of our knowledge, SoMeSci is
the most comprehensive corpus about software mentions in scientific articles,
providing training samples for Named Entity Recognition, Relation Extraction,
Entity Disambiguation, and Entity Linking. Finally, we sketch potential use
cases and provide baseline results.
Related papers
- Don't mention it: An approach to assess challenges to using software
mentions for citation and discoverability research [0.3268055538225029]
We present an approach to assess the usability of such datasets for research on research software.
One dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors.
The greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation.
arXiv Detail & Related papers (2024-02-22T14:51:17Z) - How do software citation formats evolve over time? A longitudinal
analysis of R programming language packages [12.082972614614413]
This study compares and analyzes a longitudinal dataset of citation formats of all R packages collected in 2021 and 2022.
We investigate the different document types underlying the citations and what metadata elements in the citation formats changed over time.
arXiv Detail & Related papers (2023-07-17T09:18:57Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Deep Graph Learning for Anomalous Citation Detection [55.81334139806342]
We propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks.
Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts.
arXiv Detail & Related papers (2022-02-23T09:05:28Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Pattern-based Acquisition of Scientific Entities from Scholarly Article
Titles [0.0]
We describe a rule-based approach for the automatic acquisition of scientific entities from scholarly article titles.
We identify a set of lexico-syntactic patterns that are easily recognizable.
A subset of the acquisition algorithm is implemented for article titles in the Computational Linguistics (CL) scholarly domain.
arXiv Detail & Related papers (2021-09-01T05:59:06Z) - Document Embedding for Scientific Articles: Efficacy of Word Embeddings
vs TFIDF [0.0]
This research focuses on the performance of word embeddings applied to a large scale academic corpus.
We compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles.
Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text)
arXiv Detail & Related papers (2021-07-11T23:58:39Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Investigating Software Usage in the Social Sciences: A Knowledge Graph
Approach [0.483420384410068]
We present SoftwareKG, a knowledge graph that contains information about software mentions from more than 51,000 scientific articles from the social sciences.
A neural network was used to train an LSTM based neural network to identify software mentions in scientific articles.
We show how SoftwareKG can be used to assess the role of software in the social sciences.
arXiv Detail & Related papers (2020-03-24T08:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.