CommitBART: A Large Pre-trained Model for GitHub Commits
- URL: http://arxiv.org/abs/2208.08100v1
- Date: Wed, 17 Aug 2022 06:35:57 GMT
- Title: CommitBART: A Large Pre-trained Model for GitHub Commits
- Authors: Shangqing Liu and Yanzhou Li and Yang Liu
- Abstract summary: We present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits.
The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations.
Experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code.
- Score: 8.783518592487248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GitHub commits, which record the code changes with natural language messages
for description, play a critical role for software developers to comprehend the
software evolution. To promote the development of the open-source software
community, we collect a commit benchmark including over 7.99 million commits
across 7 programming languages. Based on this benchmark, we present CommitBART,
a large pre-trained encoder-decoder Transformer model for GitHub commits. The
model is pre-trained by three categories (i.e., denoising objectives,
cross-modal generation and contrastive learning) for six pre-training tasks to
learn commit fragment representations. Furthermore, we unify a "commit
intelligence" framework with one understanding task and three generation tasks
for commits. The comprehensive experiments on these tasks demonstrate that
CommitBART significantly outperforms previous pre-trained works for code.
Further analysis also reveals each pre-training task enhances the model
performance. We encourage the follow-up researchers to contribute more
commit-related downstream tasks to our framework in the future.
Related papers
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously.
We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process.
Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z) - Enhancing MBSE Education with Version Control and Automated Feedback [0.10499611180329801]
This paper presents an innovative approach to conducting a Model-Based Systems Engineering (MBSE) course, engaging over 80 participants annually.
The course is structured around collaborative group assignments, where students utilize Enterprise Architect to complete complex systems engineering tasks across six submissions.
This year, we introduced several technological advancements to enhance the learning experience, including the use of LemonTree, SmartGit, and GitHub.
arXiv Detail & Related papers (2024-09-04T08:12:57Z) - Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs)
In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z) - RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models [49.23588578549434]
Large Language Models (LLMs) have improved the prospects for robotic tasks.
Existing benchmarks are still limited to single tasks with limited generalization capabilities.
We introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder.
arXiv Detail & Related papers (2024-06-06T05:41:47Z) - Delving into Commit-Issue Correlation to Enhance Commit Message
Generation Models [13.605167159285374]
Commit message generation is a challenging task in automated software engineering.
tool is a novel paradigm that can introduce the correlation between commits and issues into the training phase of models.
The results show that compared with the original models, the performance of tool-enhanced models is significantly improved.
arXiv Detail & Related papers (2023-07-31T20:35:00Z) - TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning.
We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge.
Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Unsupervised Learning of General-Purpose Embeddings for Code Changes [6.652641137999891]
We propose an approach for obtaining embeddings of code changes during pre-training.
We evaluate them on two different downstream tasks - applying changes to code and commit message generation.
Our model outperforms the model that uses full edit sequences by 5.9 percentage points in accuracy.
arXiv Detail & Related papers (2021-06-03T19:08:53Z) - CoreGen: Contextualized Code Representation Learning for Commit Message
Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen)
Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.