Related papers: Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

URL: http://arxiv.org/abs/2408.04863v1
Date: Fri, 9 Aug 2024 04:56:26 GMT
Title: Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?
Authors: Yu Zhao, Lina Gong, Zhiqiu Huang, Yongwei Wang, Mingqiang Wei, Fei Wu,
Abstract summary: We investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection. We propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks.
Score: 30.84647604639891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo.

Related papers

Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets [8.977790462534152]
We propose DePA, a novel line-level detection and cleansing method tailored to the structural properties of code. DePA significantly outperforms existing methods, achieving 0.14-0.19 improvement in detection F1-score and a 44-65% increase in poisoned segment localization precision.
arXiv Detail & Related papers (2025-02-27T16:30:00Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs [56.24431208419858]
We introduce underlinetextbfDirect Preference Learning with Only underlinetextbfSelf-Generated underlinetextbfTests and underlinetextbfCode (DSTC) DSTC uses only self-generated code snippets and tests to construct reliable preference pairs.
arXiv Detail & Related papers (2024-11-20T02:03:16Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants. Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code. We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z)
How to get better embeddings with code pre-trained models? An empirical study [6.220333404184779]
We study five different code pre-trained models (PTMs) to generate embeddings for downstream classification tasks. We find that embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet. The quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information.
arXiv Detail & Related papers (2023-11-14T10:44:21Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
Adversarial Attacks on Code Models with Discriminative Graph Patterns [10.543744143786519]
We propose a novel adversarial attack framework, GraphCodeAttack, to better evaluate the robustness of code models. Given a target code model, GraphCodeAttack automatically mines important code patterns, which can influence the model's decisions. To effectively synthesize attacks from AST patterns, GraphCodeAttack uses a separate pre-trained code model to fill in the ASTs with concrete code snippets.
arXiv Detail & Related papers (2023-08-22T03:40:34Z)
Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost. We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z)
ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z)
Multi-View Pre-Trained Model for Code Vulnerability Identification [10.129948567398506]
We propose a novel Multi-View Pre-Trained Model (MV-PTM) that encodes both sequential and multi-type structural information of the source code. Experiments conducted on two public datasets demonstrate the superiority of MV-PTM.
arXiv Detail & Related papers (2022-08-10T09:00:58Z)
What do pre-trained code models know about code? [9.60966128833701]
We use diagnostic tasks called probes to investigate pre-trained code models. BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow) are investigated.
arXiv Detail & Related papers (2021-08-25T16:20:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.