Leakage and the Reproducibility Crisis in ML-based Science
- URL: http://arxiv.org/abs/2207.07048v1
- Date: Thu, 14 Jul 2022 16:44:59 GMT
- Title: Leakage and the Reproducibility Crisis in ML-based Science
- Authors: Sayash Kapoor, Arvind Narayanan
- Abstract summary: We show that data leakage is indeed a widespread problem and has led to severe failures.
We present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems.
We propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey.
- Score: 5.116305213887073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of machine learning (ML) methods for prediction and forecasting has
become widespread across the quantitative sciences. However, there are many
known methodological pitfalls, including data leakage, in ML-based science. In
this paper, we systematically investigate reproducibility issues in ML-based
science. We show that data leakage is indeed a widespread problem and has led
to severe reproducibility failures. Specifically, through a survey of
literature in research communities that adopted ML methods, we find 17 fields
where errors have been found, collectively affecting 329 papers and in some
cases leading to wildly overoptimistic conclusions. Based on our survey, we
present a fine-grained taxonomy of 8 types of leakage that range from textbook
errors to open research problems.
We argue for fundamental methodological changes to ML-based science so that
cases of leakage can be caught before publication. To that end, we propose
model info sheets for reporting scientific claims based on ML models that would
address all types of leakage identified in our survey. To investigate the
impact of reproducibility errors and the efficacy of model info sheets, we
undertake a reproducibility study in a field where complex ML models are
believed to vastly outperform older statistical models such as Logistic
Regression (LR): civil war prediction. We find that all papers claiming the
superior performance of complex ML models compared to LR models fail to
reproduce due to data leakage, and complex ML models don't perform
substantively better than decades-old LR models. While none of these errors
could have been caught by reading the papers, model info sheets would enable
the detection of leakage in each case.
Related papers
- Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers.
Missci is a novel argumentation theoretical model for fallacious reasoning.
We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z) - Unraveling overoptimism and publication bias in ML-driven science [14.38643099447636]
Recent studies suggest published performance of Machine Learning models are often overoptimistic.
We introduce a novel model for observed accuracy, integrating parametric learning curves and the aforementioned biases.
Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.
arXiv Detail & Related papers (2024-05-23T10:43:20Z) - PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics [51.17512229589]
PoLLMgraph is a model-based white-box detection and forecasting approach for large language models.
We show that hallucination can be effectively detected by analyzing the LLM's internal state transition dynamics.
Our work paves a new way for model-based white-box analysis of LLMs, motivating the research community to further explore, understand, and refine the intricate dynamics of LLM behaviors.
arXiv Detail & Related papers (2024-04-06T20:02:20Z) - How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI.
LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.
We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Machine Learning Data Suitability and Performance Testing Using Fault
Injection Testing Framework [0.0]
This paper presents the Fault Injection for Undesirable Learning in input Data (FIUL-Data) testing framework.
Data mutators explore vulnerabilities of ML systems against the effects of different fault injections.
This paper evaluates the framework using data from analytical chemistry, comprising retention time measurements of anti-sense oligonucleotides.
arXiv Detail & Related papers (2023-09-20T12:58:35Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - The worst of both worlds: A comparative analysis of errors in learning
from data in psychology and machine learning [17.336655978572583]
Recent concerns that machine learning (ML) may be facing a misdiagnosis and replication crisis suggest that some published claims in ML research cannot be taken at face value.
A deeper understanding of what concerns in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective.
arXiv Detail & Related papers (2022-03-12T18:26:24Z) - The challenge of reproducible ML: an empirical study on the impact of
bugs [6.862925771672299]
In this paper, we establish the fundamental factors that cause non-determinism in Machine Learning systems.
A framework, ReproduceML, is then introduced for deterministic evaluation of ML experiments in a real, controlled environment.
This study attempts to quantify the impact that the occurrence of bugs in a popular ML framework, PyTorch, has on the performance of trained models.
arXiv Detail & Related papers (2021-09-09T01:36:39Z) - A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of
Overparameterized Machine Learning [37.01683478234978]
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field.
One of the most important riddles is the good empirical generalization of over parameterized models.
arXiv Detail & Related papers (2021-09-06T10:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.