Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software
- URL: http://arxiv.org/abs/2504.18971v1
- Date: Sat, 26 Apr 2025 16:49:49 GMT
- Title: Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software
- Authors: Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, LavĂnia Paganini, Bogdan Vasilescu, Audris Mockus,
- Abstract summary: We use large language models to classify public software repositories in World of Code.<n>We estimate survival models to understand how the domain, infrastructural layer, and other attributes affect its longevity.<n>We find that infrastructural layers, downstream dependencies, mentions of publications, and participants from government are associated with a longer lifespan.
- Score: 11.900608344217844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific software is essential to scientific innovation and in many ways it is distinct from other types of software. Abandoned (or unmaintained), buggy, and hard to use software, a perception often associated with scientific software can hinder scientific progress, yet, in contrast to other types of software, its longevity is poorly understood. Existing data curation efforts are fragmented by science domain and/or are small in scale and lack key attributes. We use large language models to classify public software repositories in World of Code into distinct scientific domains and layers of the software stack, curating a large and diverse collection of over 18,000 scientific software projects. Using this data, we estimate survival models to understand how the domain, infrastructural layer, and other attributes of scientific software affect its longevity. We further obtain a matched sample of non-scientific software repositories and investigate the differences. We find that infrastructural layers, downstream dependencies, mentions of publications, and participants from government are associated with a longer lifespan, while newer projects with participants from academia had shorter lifespan. Against common expectations, scientific projects have a longer lifetime than matched non-scientific open-source software projects. We expect our curated attribute-rich collection to support future research on scientific software and provide insights that may help extend longevity of both scientific and other projects.
Related papers
- Scaling Laws in Scientific Discovery with AI and Robot Scientists [72.3420699173245]
An autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle.
AGS aims to significantly reduce the time and resources needed for scientific discovery.
As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws.
arXiv Detail & Related papers (2025-03-28T14:00:27Z) - DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery [61.02102713094486]
Good interpretation is important in scientific reasoning, as it allows for better decision-making.<n>This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks.<n>We propose DiSciPLE an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data.
arXiv Detail & Related papers (2025-02-14T10:26:14Z) - Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.<n>VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.<n>We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - Cycling on the Freeway: The Perilous State of Open Source Neuroscience Software [46.83624918571962]
We will argue that the existing ecosystem of neuroscientific open source software is brittle.
In recent years there has been a shift toward relying on free, open-source scientific software (FOSSS) for neuroscience data analysis.
arXiv Detail & Related papers (2024-03-28T13:11:09Z) - SciCat: A Curated Dataset of Scientific Software Repositories [4.77982299447395]
We introduce the SciCat dataset -- a comprehensive collection of Free-Libre Open Source Software (FLOSS) projects.
Our approach involves selecting projects from a pool of 131 million deforked repositories from the World of Code data source.
Our classification focuses on software designed for scientific purposes, research-related projects, and research support software.
arXiv Detail & Related papers (2023-12-11T13:46:33Z) - Framework and Methodology for Verification of a Complex Scientific
Simulation Software, Flash-X [0.8437187555622163]
Computational science relies on scientific software as its primary instrument for scientific discovery.
Scientific software verification can be especially difficult, as users typically need to modify the software as part of a scientific study.
Here, we describe a methodology that we have developed for Flash-X, a community simulation software for multiple scientific domains.
arXiv Detail & Related papers (2023-08-30T17:57:37Z) - CLAIMED -- the open source framework for building coarse-grained
operators for accelerated discovery in science [0.0]
CLAIMED is a framework to build reusable operators and scalable scientific agnostic by supporting the scientist to draw from previous work by re-composing scientific operators.
CLAIMED is programming language, scientific library, and execution environment.
arXiv Detail & Related papers (2023-07-12T11:54:39Z) - The Semantic Scholar Open Data Platform [92.2948743167744]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Caching and Reproducibility: Making Data Science experiments faster and
FAIRer [25.91002326340444]
Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams.
We suggest making caching an integral part of the research software development process, even before the first line of code is written.
arXiv Detail & Related papers (2022-11-08T07:11:02Z) - End-of-Life of Software How is it Defined and Managed? [1.370633147306388]
It is becoming quicker and cheaper to abandon old software and acquire new software that meets rapidly changing needs and demands.
This paper will explore the systems engineering concept of end-of-life for software.
It will bring forward examples of software that has been abandoned in an attempt to decommission and it will explore the repercussions of abandoned software artefacts.
arXiv Detail & Related papers (2022-04-08T01:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.