Related papers: CommitBART: A Large Pre-trained Model for GitHub Commits

CommitBART: A Large Pre-trained Model for GitHub Commits

URL: http://arxiv.org/abs/2208.08100v1
Date: Wed, 17 Aug 2022 06:35:57 GMT
Title: CommitBART: A Large Pre-trained Model for GitHub Commits
Authors: Shangqing Liu and Yanzhou Li and Yang Liu
Abstract summary: We present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code.
Score: 8.783518592487248
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GitHub commits, which record the code changes with natural language messages for description, play a critical role for software developers to comprehend the software evolution. To promote the development of the open-source software community, we collect a commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Furthermore, we unify a "commit intelligence" framework with one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals each pre-training task enhances the model performance. We encourage the follow-up researchers to contribute more commit-related downstream tasks to our framework in the future.

Related papers

SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [22.03052751722933]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously. We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z)
Enhancing MBSE Education with Version Control and Automated Feedback [0.10499611180329801]
This paper presents an innovative approach to conducting a Model-Based Systems Engineering (MBSE) course, engaging over 80 participants annually. The course is structured around collaborative group assignments, where students utilize Enterprise Architect to complete complex systems engineering tasks across six submissions. This year, we introduced several technological advancements to enhance the learning experience, including the use of LemonTree, SmartGit, and GitHub.
arXiv Detail & Related papers (2024-09-04T08:12:57Z)
Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs) In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z)
RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models [49.23588578549434]
Large Language Models (LLMs) have improved the prospects for robotic tasks. Existing benchmarks are still limited to single tasks with limited generalization capabilities. We introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder.
arXiv Detail & Related papers (2024-06-06T05:41:47Z)
Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models [13.605167159285374]
Commit message generation is a challenging task in automated software engineering. tool is a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. The results show that compared with the original models, the performance of tool-enhanced models is significantly improved.
arXiv Detail & Related papers (2023-07-31T20:35:00Z)
TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge. Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z)
MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER. MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z)
Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks. We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks. Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z)
UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language. We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z)
Unsupervised Learning of General-Purpose Embeddings for Code Changes [6.652641137999891]
We propose an approach for obtaining embeddings of code changes during pre-training. We evaluate them on two different downstream tasks - applying changes to code and commit message generation. Our model outperforms the model that uses full edit sequences by 5.9 percentage points in accuracy.
arXiv Detail & Related papers (2021-06-03T19:08:53Z)
CoreGen: Contextualized Code Representation Learning for Commit Message Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen) Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.