From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
- URL: http://arxiv.org/abs/2601.22946v1
- Date: Fri, 30 Jan 2026 13:05:52 GMT
- Title: From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
- Authors: Farnaz Soltaniani, Mohammad Ghafari,
- Abstract summary: Machine learning models are increasingly used for software security tasks.<n>These models are commonly trained and evaluated on large Internet-derived datasets.<n>When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize.
- Score: 1.7188280334580195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.
Related papers
- Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis [8.725781605542675]
Large Language Models (LLMs) achieve remarkable performance through training on massive datasets.<n>LLMs can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization.<n>This paper introduces PEARL, a novel approach for detecting memorization in LLMs.
arXiv Detail & Related papers (2025-05-05T20:42:34Z) - Two Is Better Than One: Aligned Representation Pairs for Anomaly Detection [56.57122939745213]
Anomaly detection focuses on identifying samples that deviate from the norm.<n>Recent self-supervised methods have successfully learned such representations by employing prior knowledge about anomalies to create synthetic outliers during training.<n>We address this limitation with our new approach Con$$, which leverages prior knowledge about symmetries in normal samples to observe the data in different contexts.
arXiv Detail & Related papers (2024-05-29T07:59:06Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models.
textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset.
This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Robust Task-Oriented Dialogue Generation with Contrastive Pre-training
and Adversarial Filtering [17.7709632238066]
Data artifacts incentivize machine learning models to learn non-transferable generalizations.
We investigate whether popular datasets such as MultiWOZ contain such data artifacts.
We propose a contrastive learning based framework to encourage the model to ignore these cues and focus on learning generalisable patterns.
arXiv Detail & Related papers (2022-05-20T03:13:02Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z) - Detecting Faults during Automatic Screwdriving: A Dataset and Use Case
of Anomaly Detection for Automatic Screwdriving [80.6725125503521]
Data-driven approaches, using Machine Learning (ML) for detecting faults have recently gained increasing interest.
We present a use case of using ML models for detecting faults during automated screwdriving operations.
arXiv Detail & Related papers (2021-07-05T11:46:00Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - Data Impressions: Mining Deep Models to Extract Samples for Data-free
Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks.
We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z) - Trade-offs between membership privacy & adversarially robust learning [13.37805637358556]
We identify settings where standard models will overfit to a larger extent in comparison to robust models.
The degree of overfitting naturally depends on the amount of data available for training.
arXiv Detail & Related papers (2020-06-08T14:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.