CCBERT: Self-Supervised Code Change Representation Learning
- URL: http://arxiv.org/abs/2309.15474v1
- Date: Wed, 27 Sep 2023 08:17:03 GMT
- Title: CCBERT: Self-Supervised Code Change Representation Learning
- Authors: Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He and David Lo
- Abstract summary: CCBERT is a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes.
Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7%--14.0% in terms of different metrics and tasks.
- Score: 14.097775709587475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Numerous code changes are made by developers in their daily work, and a
superior representation of code changes is desired for effective code change
analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based
approach that learns a distributed representation of code changes to capture
the semantic intent of the changes. Despite demonstrated effectiveness in
multiple tasks, CC2Vec has several limitations: 1) it considers only
coarse-grained information about code changes, and 2) it relies on log messages
rather than the self-contained content of the code changes. In this work, we
propose CCBERT (\underline{C}ode \underline{C}hange \underline{BERT}), a new
Transformer-based pre-trained model that learns a generic representation of
code changes based on a large-scale dataset containing massive unlabeled code
changes. CCBERT is pre-trained on four proposed self-supervised objectives that
are specialized for learning code change representations based on the contents
of code changes. CCBERT perceives fine-grained code changes at the token level
by learning from the old and new versions of the content, along with the edit
actions. Our experiments demonstrate that CCBERT significantly outperforms
CC2Vec or the state-of-the-art approaches of the downstream tasks by
7.7\%--14.0\% in terms of different metrics and tasks. CCBERT consistently
outperforms large pre-trained code models, such as CodeBERT, while requiring
6--10$\times$ less training time, 5--30$\times$ less inference time, and
7.9$\times$ less GPU memory.
Related papers
- ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution [16.130469984234956]
ChangeGuard is an approach that uses learning-guided execution to compare the runtime behavior of a modified function.
Our results show that the approach identifies semantics-changing code changes with a precision of 77.1% and a recall of 69.5%.
arXiv Detail & Related papers (2024-10-21T15:13:32Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing [57.776971051512234]
In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same.
Our model, Coeditor, is a fine-tuned language model specifically designed for code editing tasks.
In a simplified single-round, single-edit task, Coeditor significantly outperforms GPT-3.5 and SOTA open-source code completion models.
arXiv Detail & Related papers (2023-05-29T19:57:36Z) - CCT5: A Code-Change-Oriented Pre-Trained Model [14.225942520238936]
We propose to pre-train a model specially designed for code changes to better support developers in software maintenance.
We first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages.
We fine-tune the pre-trained model, CCT5, on three widely-labelled tasks incurred by code changes and two tasks specific to the code review process.
arXiv Detail & Related papers (2023-05-18T07:55:37Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - CCRep: Learning Code Change Representations via Pre-Trained Code Model
and Query Back [8.721077261941236]
This work proposes a novel Code Change Representation learning approach named CCRep.
CCRep learns to encode code changes as feature vectors for diverse downstream tasks.
We apply CCRep to three tasks: commit message generation, patch correctness assessment, and just-in-time defect prediction.
arXiv Detail & Related papers (2023-02-08T07:43:55Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - Unsupervised Learning of General-Purpose Embeddings for Code Changes [6.652641137999891]
We propose an approach for obtaining embeddings of code changes during pre-training.
We evaluate them on two different downstream tasks - applying changes to code and commit message generation.
Our model outperforms the model that uses full edit sequences by 5.9 percentage points in accuracy.
arXiv Detail & Related papers (2021-06-03T19:08:53Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.