Pre-training by Predicting Program Dependencies for Vulnerability
Analysis Tasks
- URL: http://arxiv.org/abs/2402.00657v1
- Date: Thu, 1 Feb 2024 15:18:19 GMT
- Title: Pre-training by Predicting Program Dependencies for Vulnerability
Analysis Tasks
- Authors: Zhongxin Liu, Zhijie Tang, Junwei Zhang, Xin Xia, and Xiaohu Yang
- Abstract summary: This work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP)
CDP and DDP aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code.
After pre-training, CDP and DDP can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions.
- Score: 12.016029378106131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vulnerability analysis is crucial for software security. This work focuses on
using pre-training techniques to enhance the understanding of vulnerable code
and boost vulnerability analysis. The code understanding ability of a
pre-trained model is highly related to its pre-training objectives. The
semantic structure, e.g., control and data dependencies, of code is important
for vulnerability analysis. However, existing pre-training objectives either
ignore such structure or focus on learning to use it. The feasibility and
benefits of learning the knowledge of analyzing semantic structure have not
been investigated. To this end, this work proposes two novel pre-training
objectives, namely Control Dependency Prediction (CDP) and Data Dependency
Prediction (DDP), which aim to predict the statement-level control dependencies
and token-level data dependencies, respectively, in a code snippet only based
on its source code. During pre-training, CDP and DDP can guide the model to
learn the knowledge required for analyzing fine-grained dependencies in code.
After pre-training, the pre-trained model can boost the understanding of
vulnerable code during fine-tuning and can directly be used to perform
dependence analysis for both partial and complete functions. To demonstrate the
benefits of our pre-training objectives, we pre-train a Transformer model named
PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks,
i.e., vulnerability detection, vulnerability classification, and vulnerability
assessment, and also evaluate it on program dependence analysis. Experimental
results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art
performance on the three downstream tasks. Also, PDBERT achieves F1-scores of
over 99% and 94% for predicting control and data dependencies, respectively, in
partial and complete functions.
Related papers
- DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks.
Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG.
Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z) - In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models [37.45103473809928]
We propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model.
By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data.
arXiv Detail & Related papers (2024-08-07T05:48:05Z) - Understanding Programmatic Weak Supervision via Source-aware Influence
Function [76.74549130841383]
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels.
We build on Influence Function (IF) to decompose the end model's training objective and then calculate the influence associated with each (data, source, class)
These primitive influence score can then be used to estimate the influence of individual component PWS, such as source vote, supervision source, and training data.
arXiv Detail & Related papers (2022-05-25T15:57:24Z) - Unified Instance and Knowledge Alignment Pretraining for Aspect-based
Sentiment Analysis [96.53859361560505]
Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment polarity towards an aspect.
There always exists severe domain shift between the pretraining and downstream ABSA datasets.
We introduce a unified alignment pretraining framework into the vanilla pretrain-finetune pipeline.
arXiv Detail & Related papers (2021-10-26T04:03:45Z) - Identifying Non-Control Security-Critical Data through Program Dependence Learning [9.764831771725952]
In data-oriented attacks, a fundamental step is to identify non-control, security-critical data.
We propose a novel approach that combines traditional program analysis with deep learning.
The toolchain uncovers 80 potential critical variables in Google FuzzBench.
arXiv Detail & Related papers (2021-08-27T00:28:06Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - Relate and Predict: Structure-Aware Prediction with Jointly Optimized
Neural DAG [13.636680313054631]
We propose a deep neural network framework, dGAP, to learn neural dependency Graph and optimize structure-Aware target Prediction.
dGAP trains towards a structure self-supervision loss and a target prediction loss jointly.
We empirically evaluate dGAP on multiple simulated and real datasets.
arXiv Detail & Related papers (2021-03-03T13:55:12Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Estimating Structural Target Functions using Machine Learning and
Influence Functions [103.47897241856603]
We propose a new framework for statistical machine learning of target functions arising as identifiable functionals from statistical models.
This framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics.
We put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information.
arXiv Detail & Related papers (2020-08-14T16:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.