Automated Vulnerability Detection in Source Code Using Deep Representation Learning
- URL: http://arxiv.org/abs/2602.23121v1
- Date: Thu, 26 Feb 2026 15:35:17 GMT
- Title: Automated Vulnerability Detection in Source Code Using Deep Representation Learning
- Authors: C. Seas, G. Fitzpatrick, J. A. Hamilton, M. C. Carlisle,
- Abstract summary: We present a convolutional neural network model that can successfully identify bugs in C code.<n>We trained our model using two complementary datasets.<n>We also demonstrate on a custom Linux kernel dataset that we are able to find real vulnerabilities in complex code with a low false-positive rate.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Each year, software vulnerabilities are discovered, which pose significant risks of exploitation and system compromise. We present a convolutional neural network model that can successfully identify bugs in C code. We trained our model using two complementary datasets: a machine-labeled dataset created by Draper Labs using three static analyzers and the NIST SATE Juliet human-labeled dataset designed for testing static analyzers. In contrast with the work of Russell et al. on these datasets, we focus on C programs, enabling us to specialize and optimize our detection techniques for this language. After removing duplicates from the dataset, we tokenize the input into 91 token categories. The category values are converted to a binary vector to save memory. Our first convolution layer is chosen so that the entire encoding of the token is presented to the filter. We use two convolution and pooling layers followed by two fully connected layers to classify programs into either a common weakness enumeration category or as ``clean.'' We obtain higher recall than prior work by Russell et al. on this dataset when requiring high precision. We also demonstrate on a custom Linux kernel dataset that we are able to find real vulnerabilities in complex code with a low false-positive rate.
Related papers
- Optimizing Code Embeddings and ML Classifiers for Python Source Code Vulnerability Detection [0.0]
This study investigates the optimal combination of code embedding techniques and machine learning classifiers for vulnerability detection in Python source code.<n>We evaluate three embedding techniques, i.e., Word2Vec, CodeBERT, and GraphCodeBERT alongside two deep learning classifiers, Bidirectional Long Short-Term Memory (BiLSTM) networks and Convolutional Neural Networks (CNN)<n>While CNN paired with GraphCodeBERT exhibits strong performance, the BiLSTM model using Word2Vec consistently achieves superior overall results.
arXiv Detail & Related papers (2025-09-16T14:52:02Z) - A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - On Training a Neural Network to Explain Binaries [43.27448128029069]
In this work, we investigate the possibility of training a deep neural network on the task of binary code understanding.
We build our own dataset derived from a capture of Stack Overflow containing 1.1M entries.
arXiv Detail & Related papers (2024-04-30T15:34:51Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - AnoRand: A Semi Supervised Deep Learning Anomaly Detection Method by
Random Labeling [0.0]
Anomaly detection or more generally outliers detection is one of the most popular and challenging subject in theoretical and applied machine learning.
We present a new semi-supervised anomaly detection method called textbfAnoRand by combining a deep learning architecture with random synthetic label generation.
arXiv Detail & Related papers (2023-05-28T10:53:34Z) - Semi-Supervised and Long-Tailed Object Detection with CascadeMatch [91.86787064083012]
We propose a novel pseudo-labeling-based detector called CascadeMatch.
Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds.
We show that CascadeMatch surpasses existing state-of-the-art semi-supervised approaches in handling long-tailed object detection.
arXiv Detail & Related papers (2023-05-24T07:09:25Z) - Factorizers for Distributed Sparse Block Codes [45.29870215671697]
We propose a fast and highly accurate method for factorizing distributed block codes (SBCs)
Our iterative factorizer introduces a threshold-based nonlinear activation, conditional random sampling, and an $ell_infty$-based similarity metric.
We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets.
arXiv Detail & Related papers (2023-03-24T12:31:48Z) - Convolutional Neural Networks for the classification of glitches in
gravitational-wave data streams [52.77024349608834]
We classify transient noise signals (i.e.glitches) and gravitational waves in data from the Advanced LIGO detectors.
We use models with a supervised learning approach, both trained from scratch using the Gravity Spy dataset.
We also explore a self-supervised approach, pre-training models with automatically generated pseudo-labels.
arXiv Detail & Related papers (2023-03-24T11:12:37Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Anomaly Detection using Capsule Networks for High-dimensional Datasets [0.0]
This study uses a capsule network for the anomaly detection task.
To the best of our knowledge, this is the first instance where a capsule network is analyzed for the anomaly detection task in a high-dimensional complex data setting.
arXiv Detail & Related papers (2021-12-27T05:07:02Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - ast2vec: Utilizing Recursive Neural Encodings of Python Programs [3.088385631471295]
We present ast2vec, a neural network that maps Python syntax trees to vectors and back.
Ast2vec has been trained on almost half a million programs of novice programmers.
arXiv Detail & Related papers (2021-03-22T06:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.