Investigating Software Usage in the Social Sciences: A Knowledge Graph
Approach
- URL: http://arxiv.org/abs/2003.10715v2
- Date: Fri, 27 Aug 2021 06:38:05 GMT
- Title: Investigating Software Usage in the Social Sciences: A Knowledge Graph
Approach
- Authors: David Schindler, Benjamin Zapilko, Frank Kr\"uger
- Abstract summary: We present SoftwareKG, a knowledge graph that contains information about software mentions from more than 51,000 scientific articles from the social sciences.
A neural network was used to train an LSTM based neural network to identify software mentions in scientific articles.
We show how SoftwareKG can be used to assess the role of software in the social sciences.
- Score: 0.483420384410068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge about the software used in scientific investigations is necessary
for different reasons, including provenance of the results, measuring software
impact to attribute developers, and bibliometric software citation analysis in
general. Additionally, providing information about whether and how the software
and the source code are available allows an assessment about the state and role
of open source software in science in general. While such analyses can be done
manually, large scale analyses require the application of automated methods of
information extraction and linking. In this paper, we present SoftwareKG - a
knowledge graph that contains information about software mentions from more
than 51,000 scientific articles from the social sciences. A silver standard
corpus, created by a distant and weak supervision approach, and a gold standard
corpus, created by manual annotation, were used to train an LSTM based neural
network to identify software mentions in scientific articles. The model
achieves a recognition rate of .82 F-score in exact matches. As a result, we
identified more than 133,000 software mentions. For entity disambiguation, we
used the public domain knowledge base DBpedia. Furthermore, we linked the
entities of the knowledge graph to other knowledge bases such as the Microsoft
Academic Knowledge Graph, the Software Ontology, and Wikidata. Finally, we
illustrate, how SoftwareKG can be used to assess the role of software in the
social sciences.
Related papers
- A Systematic Literature Review on the Use of Machine Learning in Software Engineering [0.0]
The study was carried out following the objective and the research questions to explore the current state of the art in applying machine learning techniques in software engineering processes.
The review identifies the key areas within software engineering where ML has been applied, including software quality assurance, software maintenance, software comprehension, and software documentation.
arXiv Detail & Related papers (2024-06-19T23:04:27Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - SciCat: A Curated Dataset of Scientific Software Repositories [4.77982299447395]
We introduce the SciCat dataset -- a comprehensive collection of Free-Libre Open Source Software (FLOSS) projects.
Our approach involves selecting projects from a pool of 131 million deforked repositories from the World of Code data source.
Our classification focuses on software designed for scientific purposes, research-related projects, and research support software.
arXiv Detail & Related papers (2023-12-11T13:46:33Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Evaluation of software impact designed for biomedical research: Are we
measuring what's meaningful? [17.645303073710732]
Analysis of usage and impact metrics can help developers determine user and community engagement.
There are challenges associated with these analyses including distorted or misleading metrics.
Some tools may be especially beneficial to a small audience, yet may not have compelling typical usage metrics.
arXiv Detail & Related papers (2023-06-05T21:15:05Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software
Mentions in Scientific Articles [1.335443972283229]
SoMeSci is a knowledge graph of software mentions in scientific articles.
It contains high quality annotations (IRR: $kappa=.82$) of 3756 software mentions in 1367 PubMed Central articles.
arXiv Detail & Related papers (2021-08-20T08:53:03Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers
for Analyzing Data Analysis [33.190021245507445]
Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process.
We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments.
We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplieds and outperforms a suite of baselines.
arXiv Detail & Related papers (2020-08-28T19:57:49Z) - Benchmarking Graph Neural Networks [75.42159546060509]
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs.
For any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress.
GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework.
arXiv Detail & Related papers (2020-03-02T15:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.