SciCat: A Curated Dataset of Scientific Software Repositories
- URL: http://arxiv.org/abs/2312.06382v1
- Date: Mon, 11 Dec 2023 13:46:33 GMT
- Title: SciCat: A Curated Dataset of Scientific Software Repositories
- Authors: Addi Malviya-Thakur, Reed Milewicz, Lavinia Paganini, Ahmed Samir Imam
Mahmoud, Audris Mockus
- Abstract summary: We introduce the SciCat dataset -- a comprehensive collection of Free-Libre Open Source Software (FLOSS) projects.
Our approach involves selecting projects from a pool of 131 million deforked repositories from the World of Code data source.
Our classification focuses on software designed for scientific purposes, research-related projects, and research support software.
- Score: 4.77982299447395
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The proliferation of open-source scientific software for science and research
presents opportunities and challenges. In this paper, we introduce the SciCat
dataset -- a comprehensive collection of Free-Libre Open Source Software
(FLOSS) projects, designed to address the need for a curated repository of
scientific and research software. This collection is crucial for understanding
the creation of scientific software and aiding in its development. To ensure
extensive coverage, our approach involves selecting projects from a pool of 131
million deforked repositories from the World of Code data source. Subsequently,
we analyze README.md files using OpenAI's advanced language models. Our
classification focuses on software designed for scientific purposes,
research-related projects, and research support software. The SciCat dataset
aims to become an invaluable tool for researching science-related software,
shedding light on emerging trends, prevalent practices, and challenges in the
field of scientific software development. Furthermore, it includes data that
can be linked to the World of Code, GitHub, and other platforms, providing a
solid foundation for conducting comparative studies between scientific and
non-scientific software.
Related papers
- SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Framework and Methodology for Verification of a Complex Scientific
Simulation Software, Flash-X [0.8437187555622163]
Computational science relies on scientific software as its primary instrument for scientific discovery.
Scientific software verification can be especially difficult, as users typically need to modify the software as part of a scientific study.
Here, we describe a methodology that we have developed for Flash-X, a community simulation software for multiple scientific domains.
arXiv Detail & Related papers (2023-08-30T17:57:37Z) - CLAIMED -- the open source framework for building coarse-grained
operators for accelerated discovery in science [0.0]
CLAIMED is a framework to build reusable operators and scalable scientific agnostic by supporting the scientist to draw from previous work by re-composing scientific operators.
CLAIMED is programming language, scientific library, and execution environment.
arXiv Detail & Related papers (2023-07-12T11:54:39Z) - A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange.
The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - An overview of open source Deep Learning-based libraries for
Neuroscience [0.0]
This paper summarizes the main developments in Deep Learning and their relevance to Neuroscience.
It then reviews neuroinformatic toolboxes and libraries, collected from the literature and from specific hubs of software projects oriented to neuroscience research.
arXiv Detail & Related papers (2022-12-19T09:09:40Z) - Caching and Reproducibility: Making Data Science experiments faster and
FAIRer [25.91002326340444]
Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams.
We suggest making caching an integral part of the research software development process, even before the first line of code is written.
arXiv Detail & Related papers (2022-11-08T07:11:02Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.