CLAWSAT: Towards Both Robust and Accurate Code Models
- URL: http://arxiv.org/abs/2211.11711v2
- Date: Tue, 22 Nov 2022 03:38:36 GMT
- Title: CLAWSAT: Towards Both Robust and Accurate Code Models
- Authors: Jinghan Jia and Shashank Srikant and Tamara Mitrovska and Chuang Gan
and Shiyu Chang and Sijia Liu and Una-May O'Reilly
- Abstract summary: We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models.
To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models.
- Score: 74.57590254102311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We integrate contrastive learning (CL) with adversarial learning to
co-optimize the robustness and accuracy of code models. Different from existing
works, we show that code obfuscation, a standard code transformation operation,
provides novel means to generate complementary `views' of a code that enable us
to achieve both robust and accurate code models. To the best of our knowledge,
this is the first systematic study to explore and exploit the robustness and
accuracy benefits of (multi-view) code obfuscations in code models.
Specifically, we first adopt adversarial codes as robustness-promoting views in
CL at the self-supervised pre-training phase. This yields improved robustness
and transferability for downstream tasks. Next, at the supervised fine-tuning
stage, we show that adversarial training with a proper temporally-staggered
schedule of adversarial code generation can further improve robustness and
accuracy of the pre-trained code model. Built on the above two modules, we
develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by
integrating $\underline{\textrm{CL}}$ with $\underline{\textrm{a}}$dversarial
vie$\underline{\textrm{w}}$s (CLAW) with $\underline{\textrm{s}}$taggered
$\underline{\textrm{a}}$dversarial $\underline{\textrm{t}}$raining (SAT). On
evaluating three downstream tasks across Python and Java, we show that CLAWSAT
consistently yields the best robustness and accuracy ($\textit{e.g.}$ 11$\%$ in
robustness and 6$\%$ in accuracy on the code summarization task in Python). We
additionally demonstrate the effectiveness of adversarial learning in CLAW by
analyzing the characteristics of the loss landscape and interpretability of the
pre-trained models.
Related papers
- Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - Zero-Shot Code Representation Learning via Prompt Tuning [6.40875582886359]
We propose Zecoler, a zero-shot approach for learning code representations.
Zecoler is built upon a pre-trained programming language model.
We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation.
arXiv Detail & Related papers (2024-04-13T09:47:07Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.