Related papers: Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

URL: http://arxiv.org/abs/2412.02048v2
Date: Fri, 07 Feb 2025 01:42:30 GMT
Title: Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code
Authors: Gary A. McCully, John D. Hastings, Shengjie Xu,
Abstract summary: The research specifically focuses on how model performance is affected when embedding models are trained with datasets.<n>The results show that introducing data snooping did not significantly alter model performance.<n>In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings.
Score: 5.4141465747474475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study examines the impact of data snooping on neural networks used to detect vulnerabilities in lifted code, and builds on previous research that used word2vec and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained with datasets, which include samples used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings. The fact that this holds even when data snooping is introduced into the embedding model indicates GPT-2's robustness in representing complex code features, even under less-than-ideal conditions.

Related papers

Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank [50.9530591265324]
Training two-tower models on clicks collected by well-performing production systems leads to decreased ranking performance.<n>We theoretically analyze the identifiability conditions of two-tower models, showing that either document swaps across positions or overlapping feature distributions are required to recover model parameters from clicks.<n>We also investigate the effect of logging policies on two-tower models, finding that they introduce no bias when models perfectly capture user behavior.
arXiv Detail & Related papers (2025-06-25T14:47:43Z)
Model-agnostic Mitigation Strategies of Data Imbalance for Regression [0.0]
Data imbalance persists as a pervasive challenge in regression tasks, introducing bias in model performance and undermining predictive reliability.<n>We present advanced mitigation techniques, which build upon and improve existing sampling methods.<n>We demonstrate that constructing an ensemble of models -- one trained with imbalance mitigation and another without -- can significantly reduce these negative effects.
arXiv Detail & Related papers (2025-06-02T09:46:08Z)
PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning [49.60634126342945]
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information. We employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues.
arXiv Detail & Related papers (2024-06-09T07:29:55Z)
DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets. We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability. Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models. textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z)
Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision. Model compression detrimentally impacts the performance of visual prompting-based transfer. However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
Towards a robust and reliable deep learning approach for detection of compact binary mergers in gravitational wave data [0.0]
We develop a deep learning model stage-wise and work towards improving its robustness and reliability. We retrain the model in a novel framework involving a generative adversarial network (GAN) Although absolute robustness is practically impossible to achieve, we demonstrate some fundamental improvements earned through such training.
arXiv Detail & Related papers (2023-06-20T18:00:05Z)
Phantom Embeddings: Using Embedding Space for Model Regularization in Deep Neural Networks [12.293294756969477]
The strength of machine learning models stems from their ability to learn complex function approximations from data. The complex models tend to memorize the training data, which results in poor regularization performance on test data. We present a novel approach to regularize the models by leveraging the information-rich latent embeddings and their high intra-class correlation.
arXiv Detail & Related papers (2023-04-14T17:15:54Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)
Similarity Embedding Networks for Robust Human Activity Recognition [19.162857787656247]
We design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks.
arXiv Detail & Related papers (2021-05-31T11:52:32Z)
MLDS: A Dataset for Weight-Space Analysis of Neural Networks [0.0]
We present MLDS, a new dataset consisting of thousands of trained neural networks with carefully controlled parameters. This dataset enables new insights into both model-to-model and model-to-training-data relationships.
arXiv Detail & Related papers (2021-04-21T14:24:26Z)
Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
Statistical model-based evaluation of neural networks [74.10854783437351]
We develop an experimental setup for the evaluation of neural networks (NNs) The setup helps to benchmark a set of NNs vis-a-vis minimum-mean-square-error (MMSE) performance bounds. This allows us to test the effects of training data size, data dimension, data geometry, noise, and mismatch between training and testing conditions.
arXiv Detail & Related papers (2020-11-18T00:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.