LeakageDetector 2.0: Analyzing Data Leakage in Jupyter-Driven Machine Learning Pipelines
- URL: http://arxiv.org/abs/2509.15971v1
- Date: Fri, 19 Sep 2025 13:27:27 GMT
- Title: LeakageDetector 2.0: Analyzing Data Leakage in Jupyter-Driven Machine Learning Pipelines
- Authors: Owen Truong, Terrence Zhang, Arnav Marchareddy, Ryan Lee, Jeffery Busold, Michael Socas, Eman Abdullah AlOmar,
- Abstract summary: This study aims to assist Machine Learning (ML) engineers in enhancing their code by identifying and correcting Data Leakage issues within their models.<n>Data Leakage occurs when information from the test dataset is inadvertently included in the training data when preparing a data science model.
- Score: 2.61396737784983
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In software development environments, code quality is crucial. This study aims to assist Machine Learning (ML) engineers in enhancing their code by identifying and correcting Data Leakage issues within their models. Data Leakage occurs when information from the test dataset is inadvertently included in the training data when preparing a data science model, resulting in misleading performance evaluations. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. In this paper, we develop a new Visual Studio Code (VS Code) extension, called LeakageDetector, that detects Data Leakage, mainly Overlap, Preprocessing and Multi-test leakage, from Jupyter Notebook files. Beyond detection, we included two correction mechanisms: a conventional approach, known as a quick fix, which manually fixes the leakage, and an LLM-driven approach that guides ML developers toward best practices for building ML pipelines.
Related papers
- LeakageDetector: An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines [3.5453450990441238]
Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models.<n> ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code.<n>In this paper, we develop LEAKAGEDETECTOR, a Python plugin that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage.
arXiv Detail & Related papers (2025-03-18T20:53:44Z) - Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes.<n>Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z) - Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets [19.844836459291546]
High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models.<n>However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources.<n>In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning.
arXiv Detail & Related papers (2025-03-09T15:29:46Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.<n>Our method is able to work under gray-box conditions without access to model training data or weights.<n>We evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish [47.3916421056009]
Large Language Models (LLMs) are trained on massive web-crawled corpora.
LLMs produce leaked information in most cases despite less such data in their training set.
Self-detection method showed superior performance compared to existing detection methods.
arXiv Detail & Related papers (2024-03-24T13:21:58Z) - Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models [1.443696537295348]
Privacy leakage and copyright violation are still underexplored.
Our unlearning algorithms are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee.
arXiv Detail & Related papers (2024-03-13T18:57:30Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning [0.0]
This paper addresses a critical issue in Machine Learning (ML) where unintended information contaminates the training data, impacting model performance evaluation.<n>The discrepancy between evaluated and actual performance on new data is a significant concern.<n>It explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.
arXiv Detail & Related papers (2024-01-24T20:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.