Caching and Reproducibility: Making Data Science experiments faster and
FAIRer
- URL: http://arxiv.org/abs/2211.04049v2
- Date: Wed, 9 Nov 2022 14:45:50 GMT
- Title: Caching and Reproducibility: Making Data Science experiments faster and
FAIRer
- Authors: Moritz Schubotz, Ankit Satpute, Andre Greiner-Petter, Akiko Aizawa,
Bela Gipp
- Abstract summary: Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams.
We suggest making caching an integral part of the research software development process, even before the first line of code is written.
- Score: 25.91002326340444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Small to medium-scale data science experiments often rely on research
software developed ad-hoc by individual scientists or small teams. Often there
is no time to make the research software fast, reusable, and open access. The
consequence is twofold. First, subsequent researchers must spend significant
work hours building upon the proposed hypotheses or experimental framework. In
the worst case, others cannot reproduce the experiment and reuse the findings
for subsequent research. Second, suppose the ad-hoc research software fails
during often long-running computationally expensive experiments. In that case,
the overall effort to iteratively improve the software and rerun the
experiments creates significant time pressure on the researchers. We suggest
making caching an integral part of the research software development process,
even before the first line of code is written. This article outlines caching
recommendations for developing research software in data science projects. Our
recommendations provide a perspective to circumvent common problems such as
propriety dependence, speed, etc. At the same time, caching contributes to the
reproducibility of experiments in the open science workflow. Concerning the
four guiding principles, i.e., Findability, Accessibility, Interoperability,
and Reusability (FAIR), we foresee that including the proposed recommendation
in a research software development will make the data related to that software
FAIRer for both machines and humans. We exhibit the usefulness of some of the
proposed recommendations on our recently completed research software project in
mathematical information retrieval.
Related papers
- MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python.
It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z) - A pragmatic workflow for research software engineering in computational
science [0.0]
University research groups in Computational Science and Engineering (CSE) generally lack dedicated funding and personnel for Research Software Engineering (RSE)
RSE shifts the focus away from sustainable research software development and reproducible results.
We propose a RSE workflow for CSE that addresses these challenges, that improves the quality of research output in CSE.
arXiv Detail & Related papers (2023-10-02T08:04:12Z) - Managing Software Provenance to Enhance Reproducibility in Computational
Research [1.1421942894219899]
Management of computation-based scientific studies is often left to individual researchers who design their experiments based on personal preferences and the nature of the study.
We believe that the quality, efficiency, and of computation-based scientific research can be improved by explicitly creating an execution environment that allows researchers to provide a clear record of traceability.
arXiv Detail & Related papers (2023-08-29T21:13:18Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - CLAIMED -- the open source framework for building coarse-grained
operators for accelerated discovery in science [0.0]
CLAIMED is a framework to build reusable operators and scalable scientific agnostic by supporting the scientist to draw from previous work by re-composing scientific operators.
CLAIMED is programming language, scientific library, and execution environment.
arXiv Detail & Related papers (2023-07-12T11:54:39Z) - A Metadata-Based Ecosystem to Improve the FAIRness of Research Software [0.3185506103768896]
The reuse of research software is central to research efficiency and academic exchange.
The DataDesc ecosystem is presented, an approach to describing data models of software interfaces with detailed and machine-actionable metadata.
arXiv Detail & Related papers (2023-06-18T19:01:08Z) - GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets.
GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop.
We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z) - PyExperimenter: Easily distribute experiments and track results [63.871474825689134]
PyExperimenter is a tool to facilitate the setup, documentation, execution, and subsequent evaluation of results from an empirical study of algorithms.
It is intended to be used by researchers in the field of artificial intelligence, but is not limited to those.
arXiv Detail & Related papers (2023-01-16T10:43:02Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z) - A user-centered approach to designing an experimental laboratory data
platform [0.0]
We take a user-centered approach to understand what essential elements of design and functionality researchers want in an experimental data platform.
We find that having the capability to contextualize rich, complex experimental datasets is the primary user requirement.
arXiv Detail & Related papers (2020-07-28T19:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.