Related papers: Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

URL: http://arxiv.org/abs/2401.13796v1
Date: Wed, 24 Jan 2024 20:30:52 GMT
Title: Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning
Authors: Andrea Apicella, Francesco Isgr\`o, Roberto Prevete
Abstract summary: This paper addresses a critical issue in Machine Learning (ML) where unintended information contaminates the training data, impacting model performance evaluation. The discrepancy between evaluated and actual performance on new data is a significant concern. It explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, with the increasing accessibility of ML tools, many practitioners, lacking deep ML expertise, adopt a "push the button" approach, utilizing user-friendly interfaces without a thorough understanding of underlying algorithms. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. This paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Users, due to a lack of understanding, may inadvertently overlook crucial steps, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications.

Related papers

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Dissecting Misalignment of Multimodal Large Language Models via Influence Function [12.832792175138241]
We introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models. Building upon ECIF, we develop a series of algorithms for data evaluation in MLLM, misalignment detection, and misprediction trace-back tasks.
arXiv Detail & Related papers (2024-11-18T15:45:41Z)
Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z)
SoK: Dataset Copyright Auditing in Machine Learning Systems [23.00196984807359]
This paper examines the current dataset copyright auditing tools, examining their effectiveness and limitations. We categorize dataset copyright auditing research into two prominent strands: intrusive methods and non-intrusive methods. To summarize our results, we offer detailed reference tables, highlight key points, and pinpoint unresolved issues in the current literature.
arXiv Detail & Related papers (2024-10-22T02:06:38Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
Uncertainty Estimation by Fisher Information-based Evidential Deep Learning [61.94125052118442]
Uncertainty estimation is a key factor that makes deep learning reliable in practical applications. We propose a novel method, Fisher Information-based Evidential Deep Learning ($mathcalI$-EDL) In particular, we introduce Fisher Information Matrix (FIM) to measure the informativeness of evidence carried by each sample, according to which we can dynamically reweight the objective loss terms to make the network more focused on the representation learning of uncertain classes.
arXiv Detail & Related papers (2023-03-03T16:12:59Z)
Utilizing Domain Knowledge: Robust Machine Learning for Building Energy Prediction with Small, Inconsistent Datasets [1.1081836812143175]
The demand for a huge amount of data for machine learning (ML) applications is currently a bottleneck. We propose a method to combine prior knowledge with data-driven methods to significantly reduce their data dependency. CBML as the knowledge-encoded data-driven method is examined in the context of energy-efficient building engineering.
arXiv Detail & Related papers (2023-01-23T08:56:11Z)
ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference Pipeline [2.0813318162800707]
We propose ezDPS, a new efficient and zero-knowledge Machine Learning inference scheme. ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy. We show that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics.
arXiv Detail & Related papers (2022-12-11T06:47:28Z)
Rethinking Streaming Machine Learning Evaluation [9.69979862225396]
We discuss how the nature of streaming ML problems introduces new real-world challenges (e.g., delayed arrival of labels) and recommend additional metrics to assess streaming ML performance.
arXiv Detail & Related papers (2022-05-23T17:21:43Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model. Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses. BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z)
Insights into Performance Fitness and Error Metrics for Machine Learning [1.827510863075184]
Machine learning (ML) is the field of training machines to achieve high level of cognition and perform human-like analysis. This paper examines a number of the most commonly-used performance fitness and error metrics for regression and classification algorithms.
arXiv Detail & Related papers (2020-05-17T22:59:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.