Related papers: Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

Software Vulnerability Prediction Knowledge Transferring Between Programming Languages

URL: http://arxiv.org/abs/2303.06177v1
Date: Fri, 10 Mar 2023 19:21:52 GMT
Title: Software Vulnerability Prediction Knowledge Transferring Between Programming Languages
Authors: Khadija Hanifi, Ramin F Fouladi, Basak Gencer Unsalver, Goksu Karadag
Abstract summary: We propose a transfer learning technique to leverage available datasets and generate a model to detect common vulnerabilities in different programming languages. We use C source code samples to train a Convolutional Neural Network (CNN) model, then, we use Java source code samples to adopt and evaluate the learned model. The results show that proposed model detects vulnerabilities in both C and Java codes with average recall of 72%.
Score: 2.3035725779568583
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing automated and smart software vulnerability detection models has been receiving great attention from both research and development communities. One of the biggest challenges in this area is the lack of code samples for all different programming languages. In this study, we address this issue by proposing a transfer learning technique to leverage available datasets and generate a model to detect common vulnerabilities in different programming languages. We use C source code samples to train a Convolutional Neural Network (CNN) model, then, we use Java source code samples to adopt and evaluate the learned model. We use code samples from two benchmark datasets: NIST Software Assurance Reference Dataset (SARD) and Draper VDISC dataset. The results show that proposed model detects vulnerabilities in both C and Java codes with average recall of 72\%. Additionally, we employ explainable AI to investigate how much each feature contributes to the knowledge transfer mechanisms between C and Java in the proposed model.

Related papers

Detecting Code Vulnerabilities with Heterogeneous GNN Training [3.1333320740278627]
Graph Neural Network (GNN) machine learning can be a promising approach by modeling source code as graphs. This paper presents Inter-Procedural Abstract Graphs (IPAGs) as an efficient, language-agnostic representation of source code. We also propose a Heterogeneous Attention GNN (HAGNN) model that incorporates multiple subgraphs capturing different features of source code.
arXiv Detail & Related papers (2025-02-24T04:39:16Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Automated Repair of AI Code with Large Language Models and Formal Verification [4.9975496263385875]
Next generation of AI systems requires strong safety guarantees. This report looks at the software implementation of neural networks and related memory safety properties. We detect these vulnerabilities, and automatically repair them with the help of large language models.
arXiv Detail & Related papers (2024-05-14T11:52:56Z)
VULNERLIZER: Cross-analysis Between Vulnerabilities and Software Libraries [4.2755847332268235]
VULNERLIZER is a novel framework for cross-analysis between vulnerabilities and software libraries. It uses CVE and software library data together with clustering algorithms to generate links between vulnerabilities and libraries. The trained model reaches a prediction accuracy of 75% or higher.
arXiv Detail & Related papers (2023-09-18T10:34:47Z)
Language Models for Novelty Detection in System Call Traces [0.27309692684728604]
This paper introduces a novelty detection methodology that relies on a probability distribution over sequences of system calls. The proposed methodology requires minimal expert hand-crafting and achieves an F-score and AuROC greater than 95% on most novelties. The source code and trained models are publicly available on GitHub while the datasets are available on Zenodo.
arXiv Detail & Related papers (2023-09-05T13:11:40Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Evaluating few shot and Contrastive learning Methods for Code Clone Detection [5.1623866691702744]
Code Clone Detection is a software engineering task that is used for plagiarism detection, code search, and code comprehension. Deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $sim$95% on the CodeXGLUE benchmark. No previous study evaluates the generalizability of these models where a limited amount of annotated data is available.
arXiv Detail & Related papers (2022-04-15T15:01:55Z)
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.