Leakage and the Reproducibility Crisis in ML-based Science
- URL: http://arxiv.org/abs/2207.07048v1
- Date: Thu, 14 Jul 2022 16:44:59 GMT
- Title: Leakage and the Reproducibility Crisis in ML-based Science
- Authors: Sayash Kapoor, Arvind Narayanan
- Abstract summary: We show that data leakage is indeed a widespread problem and has led to severe failures.
We present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems.
We propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey.
- Score: 5.116305213887073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of machine learning (ML) methods for prediction and forecasting has
become widespread across the quantitative sciences. However, there are many
known methodological pitfalls, including data leakage, in ML-based science. In
this paper, we systematically investigate reproducibility issues in ML-based
science. We show that data leakage is indeed a widespread problem and has led
to severe reproducibility failures. Specifically, through a survey of
literature in research communities that adopted ML methods, we find 17 fields
where errors have been found, collectively affecting 329 papers and in some
cases leading to wildly overoptimistic conclusions. Based on our survey, we
present a fine-grained taxonomy of 8 types of leakage that range from textbook
errors to open research problems.
We argue for fundamental methodological changes to ML-based science so that
cases of leakage can be caught before publication. To that end, we propose
model info sheets for reporting scientific claims based on ML models that would
address all types of leakage identified in our survey. To investigate the
impact of reproducibility errors and the efficacy of model info sheets, we
undertake a reproducibility study in a field where complex ML models are
believed to vastly outperform older statistical models such as Logistic
Regression (LR): civil war prediction. We find that all papers claiming the
superior performance of complex ML models compared to LR models fail to
reproduce due to data leakage, and complex ML models don't perform
substantively better than decades-old LR models. While none of these errors
could have been caught by reading the papers, model info sheets would enable
the detection of leakage in each case.
Related papers
- Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.
In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - Analysis of Zero Day Attack Detection Using MLP and XAI [0.0]
This paper analyzes Machine Learning (ML) and Deep Learning (DL) based approaches to create Intrusion Detection Systems (IDS)
The focus is on using the KDD99 dataset, which has the most research done among all the datasets for detecting zero-day attacks.
We evaluate the performance of four multilayer perceptron (MLP) trained on the KDD99 dataset, including baseline ML models, weighted ML models, truncated ML models, and weighted truncated ML models.
arXiv Detail & Related papers (2025-01-28T02:20:34Z) - Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.
Current error classification methods rely on static and predefined categories.
We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics [51.17512229589]
PoLLMgraph is a model-based white-box detection and forecasting approach for large language models.
We show that hallucination can be effectively detected by analyzing the LLM's internal state transition dynamics.
Our work paves a new way for model-based white-box analysis of LLMs, motivating the research community to further explore, understand, and refine the intricate dynamics of LLM behaviors.
arXiv Detail & Related papers (2024-04-06T20:02:20Z) - Machine Learning Data Suitability and Performance Testing Using Fault
Injection Testing Framework [0.0]
This paper presents the Fault Injection for Undesirable Learning in input Data (FIUL-Data) testing framework.
Data mutators explore vulnerabilities of ML systems against the effects of different fault injections.
This paper evaluates the framework using data from analytical chemistry, comprising retention time measurements of anti-sense oligonucleotides.
arXiv Detail & Related papers (2023-09-20T12:58:35Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - The worst of both worlds: A comparative analysis of errors in learning
from data in psychology and machine learning [17.336655978572583]
Recent concerns that machine learning (ML) may be facing a misdiagnosis and replication crisis suggest that some published claims in ML research cannot be taken at face value.
A deeper understanding of what concerns in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective.
arXiv Detail & Related papers (2022-03-12T18:26:24Z) - The challenge of reproducible ML: an empirical study on the impact of
bugs [6.862925771672299]
In this paper, we establish the fundamental factors that cause non-determinism in Machine Learning systems.
A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment.
This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models.
arXiv Detail & Related papers (2021-09-09T01:36:39Z) - A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of
Overparameterized Machine Learning [37.01683478234978]
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field.
One of the most important riddles is the good empirical generalization of over parameterized models.
arXiv Detail & Related papers (2021-09-06T10:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.