On using distributed representations of source code for the detection of
C security vulnerabilities
- URL: http://arxiv.org/abs/2106.01367v1
- Date: Tue, 1 Jun 2021 21:18:23 GMT
- Title: On using distributed representations of source code for the detection of
C security vulnerabilities
- Authors: David Coimbra, Sofia Reis, Rui Abreu, Corina P\u{a}s\u{a}reanu, Hakan
Erdogmus
- Abstract summary: This paper presents an evaluation of the code representation model Code2vec when trained on the task of detecting security vulnerabilities in C source code.
We leverage the open-source library astminer to extract path-contexts from the abstract syntax trees of a corpus of labeled C functions.
Code2vec is trained on the resulting path-contexts with the task of classifying a function as vulnerable or non-vulnerable.
- Score: 14.8831988481175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an evaluation of the code representation model Code2vec
when trained on the task of detecting security vulnerabilities in C source
code. We leverage the open-source library astminer to extract path-contexts
from the abstract syntax trees of a corpus of labeled C functions. Code2vec is
trained on the resulting path-contexts with the task of classifying a function
as vulnerable or non-vulnerable. Using the CodeXGLUE benchmark, we show that
the accuracy of Code2vec for this task is comparable to simple
transformer-based methods such as pre-trained RoBERTa, and outperforms more
naive NLP-based methods. We achieved an accuracy of 61.43% while maintaining
low computational requirements relative to larger models.
Related papers
- HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code [4.956066467858057]
This research explores vulnerability detection using natural language processing (NLP) embedding techniques with word2vec, BERT, and RoBERTa.
Long short-term memory (LSTM) neural networks were trained on embeddings from encoders created using approximately 48k LLVM functions from the Juliet dataset.
arXiv Detail & Related papers (2024-05-31T03:57:19Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.
We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching [8.655595404611821]
We introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features.
Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task.
arXiv Detail & Related papers (2024-01-20T07:57:57Z) - VMCDL: Vulnerability Mining Based on Cascaded Deep Learning Under Source
Control Flow [2.561778620560749]
This paper mainly use the c/c++ source code data of the SARD dataset, process the source code of CWE476, CWE469, CWE516 and CWE570 vulnerability types.
We propose a new cascading deep learning model VMCDL based on source code control flow to effectively detect vulnerabilities.
arXiv Detail & Related papers (2023-03-13T13:58:39Z) - VulBERTa: Simplified Source Code Pre-Training for Vulnerability
Detection [1.256413718364189]
VulBERTa is a deep learning approach to detect security vulnerabilities in source code.
Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects.
We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets.
arXiv Detail & Related papers (2022-05-25T00:56:43Z) - Learning Stable Classifiers by Transferring Unstable Features [59.06169363181417]
We study transfer learning in the presence of spurious correlations.
We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task.
We hypothesize that the unstable features in the source task and those in the target task are directly related.
arXiv Detail & Related papers (2021-06-15T02:41:12Z) - Detecting Security Fixes in Open-Source Repositories using Static Code
Analyzers [8.716427214870459]
We study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications.
We investigate how such features can be used to construct embeddings and train ML models to automatically identify source code commits that contain vulnerability fixes.
We find that the combination of our method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities.
arXiv Detail & Related papers (2021-05-07T15:57:17Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z) - Second-Order Provable Defenses against Adversarial Attacks [63.34032156196848]
We show that if the eigenvalues of the network are bounded, we can compute a certificate in the $l$ norm efficiently using convex optimization.
We achieve certified accuracy of 5.78%, and 44.96%, and 43.19% on 2,59% and 4BP-based methods respectively.
arXiv Detail & Related papers (2020-06-01T05:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.