Unsupervised Learning of General-Purpose Embeddings for Code Changes
- URL: http://arxiv.org/abs/2106.02087v1
- Date: Thu, 3 Jun 2021 19:08:53 GMT
- Title: Unsupervised Learning of General-Purpose Embeddings for Code Changes
- Authors: Mikhail Pravilov, Egor Bogomolov, Yaroslav Golubev, Timofey Bryksin
- Abstract summary: We propose an approach for obtaining embeddings of code changes during pre-training.
We evaluate them on two different downstream tasks - applying changes to code and commit message generation.
Our model outperforms the model that uses full edit sequences by 5.9 percentage points in accuracy.
- Score: 6.652641137999891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A lot of problems in the field of software engineering - bug fixing, commit
message generation, etc. - require analyzing not only the code itself but
specifically code changes. Applying machine learning models to these tasks
requires us to create numerical representations of the changes, i.e.
embeddings. Recent studies demonstrate that the best way to obtain these
embeddings is to pre-train a deep neural network in an unsupervised manner on a
large volume of unlabeled data and then further fine-tune it for a specific
task.
In this work, we propose an approach for obtaining such embeddings of code
changes during pre-training and evaluate them on two different downstream tasks
- applying changes to code and commit message generation. The pre-training
consists of the model learning to apply the given change (an edit sequence) to
the code in a correct way, and therefore requires only the code change itself.
To increase the quality of the obtained embeddings, we only consider the
changed tokens in the edit sequence. In the task of applying code changes, our
model outperforms the model that uses full edit sequences by 5.9 percentage
points in accuracy. As for the commit message generation, our model
demonstrated the same results as supervised models trained for this specific
task, which indicates that it can encode code changes well and can be improved
in the future by pre-training on a larger dataset of easily gathered code
changes.
Related papers
- CCBERT: Self-Supervised Code Change Representation Learning [14.097775709587475]
CCBERT is a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes.
Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7%--14.0% in terms of different metrics and tasks.
arXiv Detail & Related papers (2023-09-27T08:17:03Z) - Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing [57.776971051512234]
In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same.
Our model, Coeditor, is a fine-tuned language model specifically designed for code editing tasks.
In a simplified single-round, single-edit task, Coeditor significantly outperforms GPT-3.5 and SOTA open-source code completion models.
arXiv Detail & Related papers (2023-05-29T19:57:36Z) - CCT5: A Code-Change-Oriented Pre-Trained Model [14.225942520238936]
We propose to pre-train a model specially designed for code changes to better support developers in software maintenance.
We first collect a large-scale dataset containing 1.5M+ pairwise data of code changes and commit messages.
We fine-tune the pre-trained model, CCT5, on three widely-labelled tasks incurred by code changes and two tasks specific to the code review process.
arXiv Detail & Related papers (2023-05-18T07:55:37Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - CCRep: Learning Code Change Representations via Pre-Trained Code Model
and Query Back [8.721077261941236]
This work proposes a novel Code Change Representation learning approach named CCRep.
CCRep learns to encode code changes as feature vectors for diverse downstream tasks.
We apply CCRep to three tasks: commit message generation, patch correctness assessment, and just-in-time defect prediction.
arXiv Detail & Related papers (2023-02-08T07:43:55Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Aging with GRACE: Lifelong Model Editing with Discrete Key-Value
Adaptors [53.819805242367345]
We propose GRACE, a lifelong model editing method, which implements spot-fixes on streaming errors of a deployed model.
GRACE writes new mappings into a pre-trained model's latent space, creating a discrete, local codebook of edits without altering model weights.
Our experiments on T5, BERT, and GPT models show GRACE's state-of-the-art performance in making and retaining edits, while generalizing to unseen inputs.
arXiv Detail & Related papers (2022-11-20T17:18:22Z) - CodeEditor: Learning to Edit Source Code with Pre-trained Models [47.736781998792]
This paper presents an effective pre-trained code editing model named CodeEditor.
We collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions.
We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings.
arXiv Detail & Related papers (2022-10-31T03:26:33Z) - Editing Factual Knowledge in Language Models [51.947280241185]
We present KnowledgeEditor, a method that can be used to edit this knowledge.
Besides being computationally efficient, KnowledgeEditor does not require any modifications in LM pre-training.
We show KnowledgeEditor's efficacy with two popular architectures and knowledge-intensive tasks.
arXiv Detail & Related papers (2021-04-16T15:24:42Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.