Explaining the Contributing Factors for Vulnerability Detection in Machine Learning
- URL: http://arxiv.org/abs/2406.03577v1
- Date: Wed, 5 Jun 2024 18:48:00 GMT
- Title: Explaining the Contributing Factors for Vulnerability Detection in Machine Learning
- Authors: Esma Mouine, Yan Liu, Lu Xiao, Rick Kazman, Xiao Wang,
- Abstract summary: There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities.
This study investigates how the combination of different vulnerability features and three representative machine learning models impact the accuracy of vulnerability detection in 17 real-world projects.
- Score: 16.14514846773874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, there lacks of experience regarding the transferability of the vulnerability signatures from project to project. This study investigates how the combination of different vulnerability features and three representative machine learning models impact the accuracy of vulnerability detection in 17 real-world projects. We examine two types of vulnerability representations: 1) code features extracted through NLP with varying tokenization strategies and three different embedding techniques (bag-of-words, word2vec, and fastText) and 2) a set of eight architectural metrics that capture the abstract design of the software systems. The three machine learning algorithms include a random forest model, a support vector machines model, and a residual neural network model. The analysis shows a recommended baseline model with signatures extracted through bag-of-words embedding, combined with the random forest, consistently increases the detection accuracy by about 4% compared to other combinations in all 17 projects. Furthermore, we observe the limitation of transferring vulnerability signatures across domains based on our experiments.
Related papers
- Deep Learning-based Binary Analysis for Vulnerability Detection in x86-64 Machine Code [0.0]
This paper explores the feasibility of extracting features directly from raw x86-64 machine code.<n>It shows that machine code may enable more efficient, lightweight models and preserve all information that might be lost in disassembly.
arXiv Detail & Related papers (2026-01-14T04:52:47Z) - Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z) - When Machine Learning Meets Vulnerability Discovery: Challenges and Lessons Learned [3.000275719116454]
In this paper, we explore the challenges of applying machine learning to vulnerability discovery.<n>First, researchers often fail to provide concrete statistics about their training datasets.<n> Secondly, the choice of a model and the level of granularity at which models are trained also affect the effectiveness of such vulnerability discovery approaches.
arXiv Detail & Related papers (2025-08-20T20:09:49Z) - Deep Learning Models for Robust Facial Liveness Detection [56.08694048252482]
This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques.<n>By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision.
arXiv Detail & Related papers (2025-08-12T17:19:20Z) - Enhancing Software Vulnerability Detection Using Code Property Graphs and Convolutional Neural Networks [0.0]
This paper proposes a novel approach to detecting software vulnerabilities using a combination of code property graphs and machine learning techniques.
We introduce various neural network models, including convolutional neural networks adapted for graph data, to process these representations.
Our contributions include a methodology for transforming software code into code property graphs, the implementation of a convolutional neural network model for graph data, and the creation of a comprehensive dataset for training and evaluation.
arXiv Detail & Related papers (2025-03-23T19:12:07Z) - Deep Learning Aided Software Vulnerability Detection: A Survey [3.4396557936415686]
The pervasive nature of software vulnerabilities has emerged as a primary factor for the surge in cyberattacks.
Deep learning (DL) methods excel at automatically learning and identifying complex patterns in code.
This survey analyzes 34 relevant studies from high-impact journals and conferences between 2017 and 2024.
arXiv Detail & Related papers (2025-03-06T01:35:16Z) - Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI.
We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts.
We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z) - LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection.
This paper provides a detailed survey of LLMs in vulnerability detection.
We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z) - A Combined Feature Embedding Tools for Multi-Class Software Defect and Identification [2.2020053359163305]
We present CodeGraphNet, an experimental method that combines GraphCodeBERT and Graph Convolutional Network approaches.
This method captures intricate relation- ships between features, providing for more exact identification and separation of vulnerabilities.
The DeepTree model, which is a hybrid of a Decision Tree and a Neural Network, outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T17:33:02Z) - Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection [9.652886240532741]
This paper thoroughly analyses large language models' capabilities in detecting vulnerabilities within source code.
We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs.
arXiv Detail & Related papers (2024-08-29T10:00:57Z) - Concrete Surface Crack Detection with Convolutional-based Deep Learning
Models [0.0]
Crack detection is pivotal for structural health monitoring and inspection of buildings.
Convolutional neural networks (CNNs) have emerged as a promising framework for crack detection.
We employ fine-tuning techniques on pre-trained deep learning architectures.
arXiv Detail & Related papers (2024-01-13T17:31:12Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - Deep-Learning-based Vulnerability Detection in Binary Executables [0.0]
We present a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables.
A dataset with 50,651 samples of vulnerable code in the form of a standardized LLVM Intermediate Representation is used.
A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability.
arXiv Detail & Related papers (2022-11-25T10:33:33Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - An Explainable Machine Learning-based Network Intrusion Detection System
for Enabling Generalisability in Securing IoT Networks [0.0]
Machine Learning (ML)-based network intrusion detection systems bring many benefits for enhancing the security posture of an organisation.
Many systems have been designed and developed in the research community, often achieving a perfect detection rate when evaluated using certain datasets.
This paper tightens the gap by evaluating the generalisability of a common feature set to different network environments and attack types.
arXiv Detail & Related papers (2021-04-15T00:44:45Z) - Increasing the Confidence of Deep Neural Networks by Coverage Analysis [71.57324258813674]
This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model against different unsafe inputs.
Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs.
arXiv Detail & Related papers (2021-01-28T16:38:26Z) - Dos and Don'ts of Machine Learning in Computer Security [74.1816306998445]
Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance.
We identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.
We propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.
arXiv Detail & Related papers (2020-10-19T13:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.