Related papers: Caching and Reproducibility: Making Data Science experiments faster and FAIRer

Caching and Reproducibility: Making Data Science experiments faster and FAIRer

URL: http://arxiv.org/abs/2211.04049v2
Date: Wed, 9 Nov 2022 14:45:50 GMT
Title: Caching and Reproducibility: Making Data Science experiments faster and FAIRer
Authors: Moritz Schubotz, Ankit Satpute, Andre Greiner-Petter, Akiko Aizawa, Bela Gipp
Abstract summary: Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. We suggest making caching an integral part of the research software development process, even before the first line of code is written.
Score: 25.91002326340444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computationally expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

Related papers

Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software [11.900608344217844]
We use large language models to classify public software repositories in World of Code. We estimate survival models to understand how the domain, infrastructural layer, and other attributes affect its longevity. We find that infrastructural layers, downstream dependencies, mentions of publications, and participants from government are associated with a longer lifespan.
arXiv Detail & Related papers (2025-04-26T16:49:49Z)
A Dataset For Computational Reproducibility [2.147712260420443]
This article introduces a dataset of computational experiments covering a broad spectrum of scientific fields. It incorporates details about software dependencies, execution steps, and configurations necessary for accurate reproduction. It provides a universal benchmark by establishing a standardized dataset for objectively evaluating and comparing the effectiveness of tools.
arXiv Detail & Related papers (2025-04-11T16:45:10Z)
CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation [48.12054700748627]
We introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly. We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments.
arXiv Detail & Related papers (2025-03-20T22:37:17Z)
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery [61.02102713094486]
Good interpretation is important in scientific reasoning, as it allows for better decision-making. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data.
arXiv Detail & Related papers (2025-02-14T10:26:14Z)
MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python. It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z)
A pragmatic workflow for research software engineering in computational science [0.0]
University research groups in Computational Science and Engineering (CSE) generally lack dedicated funding and personnel for Research Software Engineering (RSE) RSE shifts the focus away from sustainable research software development and reproducible results. We propose a RSE workflow for CSE that addresses these challenges, that improves the quality of research output in CSE.
arXiv Detail & Related papers (2023-10-02T08:04:12Z)
Managing Software Provenance to Enhance Reproducibility in Computational Research [1.1421942894219899]
Management of computation-based scientific studies is often left to individual researchers who design their experiments based on personal preferences and the nature of the study. We believe that the quality, efficiency, and of computation-based scientific research can be improved by explicitly creating an execution environment that allows researchers to provide a clear record of traceability.
arXiv Detail & Related papers (2023-08-29T21:13:18Z)
Using Machine Learning To Identify Software Weaknesses From Software Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications. Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z)
CLAIMED -- the open source framework for building coarse-grained operators for accelerated discovery in science [0.0]
CLAIMED is a framework to build reusable operators and scalable scientific agnostic by supporting the scientist to draw from previous work by re-composing scientific operators. CLAIMED is programming language, scientific library, and execution environment.
arXiv Detail & Related papers (2023-07-12T11:54:39Z)
A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange. The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z)
GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets. GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z)
PyExperimenter: Easily distribute experiments and track results [63.871474825689134]
PyExperimenter is a tool to facilitate the setup, documentation, execution, and subsequent evaluation of results from an empirical study of algorithms. It is intended to be used by researchers in the field of artificial intelligence, but is not limited to those.
arXiv Detail & Related papers (2023-01-16T10:43:02Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
Benchopt: Reproducible, efficient and collaborative optimization benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
A user-centered approach to designing an experimental laboratory data platform [0.0]
We take a user-centered approach to understand what essential elements of design and functionality researchers want in an experimental data platform. We find that having the capability to contextualize rich, complex experimental datasets is the primary user requirement.
arXiv Detail & Related papers (2020-07-28T19:26:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.