Don't Push the Button! Exploring Data Leakage Risks in Machine Learning
and Transfer Learning
- URL: http://arxiv.org/abs/2401.13796v1
- Date: Wed, 24 Jan 2024 20:30:52 GMT
- Title: Don't Push the Button! Exploring Data Leakage Risks in Machine Learning
and Transfer Learning
- Authors: Andrea Apicella, Francesco Isgr\`o, Roberto Prevete
- Abstract summary: This paper addresses a critical issue in Machine Learning (ML) where unintended information contaminates the training data, impacting model performance evaluation.
The discrepancy between evaluated and actual performance on new data is a significant concern.
It explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning (ML) has revolutionized various domains, offering predictive
capabilities in several areas. However, with the increasing accessibility of ML
tools, many practitioners, lacking deep ML expertise, adopt a "push the button"
approach, utilizing user-friendly interfaces without a thorough understanding
of underlying algorithms. While this approach provides convenience, it raises
concerns about the reliability of outcomes, leading to challenges such as
incorrect performance evaluation. This paper addresses a critical issue in ML,
known as data leakage, where unintended information contaminates the training
data, impacting model performance evaluation. Users, due to a lack of
understanding, may inadvertently overlook crucial steps, leading to optimistic
performance estimates that may not hold in real-world scenarios. The
discrepancy between evaluated and actual performance on new data is a
significant concern. In particular, this paper categorizes data leakage in ML,
discussing how certain conditions can propagate through the ML workflow.
Furthermore, it explores the connection between data leakage and the specific
task being addressed, investigates its occurrence in Transfer Learning, and
compares standard inductive ML with transductive ML frameworks. The conclusion
summarizes key findings, emphasizing the importance of addressing data leakage
for robust and reliable ML applications.
Related papers
- Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations.
Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z) - Uncertainty Estimation by Fisher Information-based Evidential Deep
Learning [61.94125052118442]
Uncertainty estimation is a key factor that makes deep learning reliable in practical applications.
We propose a novel method, Fisher Information-based Evidential Deep Learning ($mathcalI$-EDL)
In particular, we introduce Fisher Information Matrix (FIM) to measure the informativeness of evidence carried by each sample, according to which we can dynamically reweight the objective loss terms to make the network more focused on the representation learning of uncertain classes.
arXiv Detail & Related papers (2023-03-03T16:12:59Z) - Utilizing Domain Knowledge: Robust Machine Learning for Building Energy
Prediction with Small, Inconsistent Datasets [1.1081836812143175]
The demand for a huge amount of data for machine learning (ML) applications is currently a bottleneck.
We propose a method to combine prior knowledge with data-driven methods to significantly reduce their data dependency.
CBML as the knowledge-encoded data-driven method is examined in the context of energy-efficient building engineering.
arXiv Detail & Related papers (2023-01-23T08:56:11Z) - ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference
Pipeline [2.0813318162800707]
We propose ezDPS, a new efficient and zero-knowledge Machine Learning inference scheme.
ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy.
We show that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics.
arXiv Detail & Related papers (2022-12-11T06:47:28Z) - Rethinking Streaming Machine Learning Evaluation [9.69979862225396]
We discuss how the nature of streaming ML problems introduces new real-world challenges (e.g., delayed arrival of labels) and recommend additional metrics to assess streaming ML performance.
arXiv Detail & Related papers (2022-05-23T17:21:43Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Transfer Learning without Knowing: Reprogramming Black-box Machine
Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model.
Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses.
BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z) - Insights into Performance Fitness and Error Metrics for Machine Learning [1.827510863075184]
Machine learning (ML) is the field of training machines to achieve high level of cognition and perform human-like analysis.
This paper examines a number of the most commonly-used performance fitness and error metrics for regression and classification algorithms.
arXiv Detail & Related papers (2020-05-17T22:59:04Z) - Mind the Gap: On Bridging the Semantic Gap between Machine Learning and
Information Security [3.9629825964453986]
Despite the potential of Machine learning to learn the behavior of malware, detect novel malware samples, and significantly improve information security we see few, if any, high-impact ML techniques in deployed systems.
We hypothesize that the failure of ML in making high-impacts in InfoSec are rooted in a disconnect between the two communities.
Specifically, current datasets and representations used by ML are not suitable for learning the behaviors of an executable.
arXiv Detail & Related papers (2020-05-04T19:19:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.