The challenge of reproducible ML: an empirical study on the impact of
bugs
- URL: http://arxiv.org/abs/2109.03991v1
- Date: Thu, 9 Sep 2021 01:36:39 GMT
- Title: The challenge of reproducible ML: an empirical study on the impact of
bugs
- Authors: Emilio Rivera-Landos, Foutse Khomh, Amin Nikanjam
- Abstract summary: In this paper, we establish the fundamental factors that cause non-determinism in Machine Learning systems.
A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment.
This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models.
- Score: 6.862925771672299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reproducibility is a crucial requirement in scientific research. When results
of research studies and scientific papers have been found difficult or
impossible to reproduce, we face a challenge which is called reproducibility
crisis. Although the demand for reproducibility in Machine Learning (ML) is
acknowledged in the literature, a main barrier is inherent non-determinism in
ML training and inference. In this paper, we establish the fundamental factors
that cause non-determinism in ML systems. A framework, ReproduceML, is then
introduced for deterministic evaluation of ML experiments in a real, controlled
environment. ReproduceML allows researchers to investigate software
configuration effects on ML training and inference. Using ReproduceML, we run a
case study: investigation of the impact of bugs inside ML libraries on
performance of ML experiments. This study attempts to quantify the impact that
the occurrence of bugs in a popular ML framework, PyTorch, has on the
performance of trained models. To do so, a comprehensive methodology is
proposed to collect buggy versions of ML libraries and run deterministic ML
experiments using ReproduceML. Our initial finding is that there is no evidence
based on our limited dataset to show that bugs which occurred in PyTorch do
affect the performance of trained models. The proposed methodology as well as
ReproduceML can be employed for further research on non-determinism and bugs.
Related papers
- Recent Advances on Machine Learning for Computational Fluid Dynamics: A Survey [51.87875066383221]
This paper introduces fundamental concepts, traditional methods, and benchmark datasets, then examine the various roles Machine Learning plays in improving CFD.
We highlight real-world applications of ML for CFD in critical scientific and engineering disciplines, including aerodynamics, combustion, atmosphere & ocean science, biology fluid, plasma, symbolic regression, and reduced order modeling.
We draw the conclusion that ML is poised to significantly transform CFD research by enhancing simulation accuracy, reducing computational time, and enabling more complex analyses of fluid dynamics.
arXiv Detail & Related papers (2024-08-22T07:33:11Z) - Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers [1.4841630983274845]
Research in various fields is currently experiencing challenges regarding awareness of results.
This problem is also prevalent in machine learning (ML) research.
The level of in ML-driven research remains unsatisfactory.
arXiv Detail & Related papers (2024-06-20T13:56:42Z) - MLXP: A Framework for Conducting Replicable Experiments in Python [63.37350735954699]
We propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python.
It streamlines the experimental process with minimal overhead while ensuring a high level of practitioner overhead.
arXiv Detail & Related papers (2024-02-21T14:22:20Z) - Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs.
We identify four independent factors that can contribute to this limitation.
Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z) - Are You Sure? Challenging LLMs Leads to Performance Drops in The
FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs)
We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect)
We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z) - Julearn: an easy-to-use library for leakage-free evaluation and
inspection of ML models [0.23301643766310373]
We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects.
Julearn aims to simplify the entry into the machine learning world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls.
arXiv Detail & Related papers (2023-10-19T08:21:12Z) - Reproducibility in Machine Learning-Driven Research [1.7936835766396748]
Research is facing a viability crisis, in which the results and findings of many studies are difficult or even impossible to reproduce.
This is also the case in machine learning (ML) and artificial intelligence (AI) research.
Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of in ML-driven research is not increasing substantially.
arXiv Detail & Related papers (2023-07-19T07:00:22Z) - Leakage and the Reproducibility Crisis in ML-based Science [5.116305213887073]
We show that data leakage is indeed a widespread problem and has led to severe failures.
We present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems.
We propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey.
arXiv Detail & Related papers (2022-07-14T16:44:59Z) - Understanding the Usability Challenges of Machine Learning In
High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains.
In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions.
We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z) - Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data
Principles [0.0]
We describe our goals and initial steps in supporting the end-to-end of machine learning pipelines.
We investigate which factors beyond the availability of source code and datasets influence the influence of ML experiments.
We propose ways to apply FAIR data practices to ML experiments.
arXiv Detail & Related papers (2020-06-22T10:17:34Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.