Do Language Models Learn Semantics of Code? A Case Study in
Vulnerability Detection
- URL: http://arxiv.org/abs/2311.04109v1
- Date: Tue, 7 Nov 2023 16:31:56 GMT
- Title: Do Language Models Learn Semantics of Code? A Case Study in
Vulnerability Detection
- Authors: Benjamin Steenhoek, Md Mahbubur Rahman, Shaila Sharmin, and Wei Le
- Abstract summary: We analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis.
We develop two annotation methods which highlight the bug semantics inside the model's inputs.
Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics.
- Score: 7.725755567907359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, pretrained language models have shown state-of-the-art performance
on the vulnerability detection task. These models are pretrained on a large
corpus of source code, then fine-tuned on a smaller supervised vulnerability
dataset. Due to the different training objectives and the performance of the
models, it is interesting to consider whether the models have learned the
semantics of code relevant to vulnerability detection, namely bug semantics,
and if so, how the alignment to bug semantics relates to model performance. In
this paper, we analyze the models using three distinct methods:
interpretability tools, attention analysis, and interaction matrix analysis. We
compare the models' influential feature sets with the bug semantic features
which define the causes of bugs, including buggy paths and Potentially
Vulnerable Statements (PVS). We find that (1) better-performing models also
aligned better with PVS, (2) the models failed to align strongly to PVS, and
(3) the models failed to align at all to buggy paths. Based on our analysis, we
developed two annotation methods which highlight the bug semantics inside the
model's inputs. We evaluated our approach on four distinct transformer models
and four vulnerability datasets and found that our annotations improved the
models' performance in the majority of settings - 11 out of 16, with up to 9.57
points improvement in F1 score compared to conventional fine-tuning. We further
found that with our annotations, the models aligned up to 232% better to
potentially vulnerable statements. Our findings indicate that it is helpful to
provide the model with information of the bug semantics, that the model can
attend to it, and motivate future work in learning more complex path-based bug
semantics. Our code and data are available at
https://figshare.com/s/4a16a528d6874aad51a0.
Related papers
- Towards Causal Deep Learning for Vulnerability Detection [31.59558109518435]
We introduce do calculus based causal learning to software engineering models.
Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance.
arXiv Detail & Related papers (2023-10-12T00:51:06Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - An Empirical Study of Deep Learning Models for Vulnerability Detection [4.243592852049963]
We surveyed and reproduced 9 state-of-the-art deep learning models on 2 widely used vulnerability detection datasets.
We investigated model capabilities, training data, and model interpretation.
Our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models.
arXiv Detail & Related papers (2022-12-15T19:49:34Z) - Part-Based Models Improve Adversarial Robustness [57.699029966800644]
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks.
Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts.
Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations.
arXiv Detail & Related papers (2022-09-15T15:41:47Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Towards Trustworthy Deception Detection: Benchmarking Model Robustness
across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English.
We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Debugging Tests for Model Explanations [18.073554618753395]
Methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples.
We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions.
arXiv Detail & Related papers (2020-11-10T22:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.