CodeAttack: Code-Based Adversarial Attacks for Pre-trained Programming
Language Models
- URL: http://arxiv.org/abs/2206.00052v3
- Date: Tue, 18 Apr 2023 22:12:55 GMT
- Title: CodeAttack: Code-Based Adversarial Attacks for Pre-trained Programming
Language Models
- Authors: Akshita Jha, and Chandan K. Reddy
- Abstract summary: We propose, CodeAttack, a simple yet effective black-box attack model that uses code structure to generate effective, efficient, and imperceptible adversarial code samples.
We evaluate the transferability of CodeAttack on several code-code (translation and repair) and code-NL (summarization) tasks across different programming languages.
- Score: 8.832864937330722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained programming language (PL) models (such as CodeT5, CodeBERT,
GraphCodeBERT, etc.,) have the potential to automate software engineering tasks
involving code understanding and code generation. However, these models operate
in the natural channel of code, i.e., they are primarily concerned with the
human understanding of the code. They are not robust to changes in the input
and thus, are potentially susceptible to adversarial attacks in the natural
channel. We propose, CodeAttack, a simple yet effective black-box attack model
that uses code structure to generate effective, efficient, and imperceptible
adversarial code samples and demonstrates the vulnerabilities of the
state-of-the-art PL models to code-specific adversarial attacks. We evaluate
the transferability of CodeAttack on several code-code (translation and repair)
and code-NL (summarization) tasks across different programming languages.
CodeAttack outperforms state-of-the-art adversarial NLP attack models to
achieve the best overall drop in performance while being more efficient,
imperceptible, consistent, and fluent. The code can be found at
https://github.com/reddy-lab-code-research/CodeAttack.
Related papers
- Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy.
Results indicate that MANGO significantly improves the code pass rate based on the strong baselines.
The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - Adversarial Attacks on Code Models with Discriminative Graph Patterns [10.543744143786519]
We propose a novel adversarial attack framework, GraphCodeAttack, to better evaluate the robustness of code models.
Given a target code model, GraphCodeAttack automatically mines important code patterns, which can influence the model's decisions.
To effectively synthesize attacks from AST patterns, GraphCodeAttack uses a separate pre-trained code model to fill in the ASTs with concrete code snippets.
arXiv Detail & Related papers (2023-08-22T03:40:34Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.