Just-In-Time Software Defect Prediction via Bi-modal Change Representation Learning
- URL: http://arxiv.org/abs/2410.12107v1
- Date: Tue, 15 Oct 2024 23:13:29 GMT
- Title: Just-In-Time Software Defect Prediction via Bi-modal Change Representation Learning
- Authors: Yuze Jiang, Beijun Shen, Xiaodong Gu,
- Abstract summary: We introduce a novel bi-modal change pre-training model called BiCC-BERT.
BiCC-BERT is pre-trained on a code change corpus to learn bi-modal semantic representations.
We train JIT-BiCC using 27,391 code changes and compare its performance with 8 state-of-the-art JIT-DP approaches.
- Score: 5.04327119462716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For predicting software defects at an early stage, researchers have proposed just-in-time defect prediction (JIT-DP) to identify potential defects in code commits. The prevailing approaches train models to represent code changes in history commits and utilize the learned representations to predict the presence of defects in the latest commit. However, existing models merely learn editions in source code, without considering the natural language intentions behind the changes. This limitation hinders their ability to capture deeper semantics. To address this, we introduce a novel bi-modal change pre-training model called BiCC-BERT. BiCC-BERT is pre-trained on a code change corpus to learn bi-modal semantic representations. To incorporate commit messages from the corpus, we design a novel pre-training objective called Replaced Message Identification (RMI), which learns the semantic association between commit messages and code changes. Subsequently, we integrate BiCC-BERT into JIT-DP and propose a new defect prediction approach -- JIT-BiCC. By leveraging the bi-modal representations from BiCC-BERT, JIT-BiCC captures more profound change semantics. We train JIT-BiCC using 27,391 code changes and compare its performance with 8 state-of-the-art JIT-DP approaches. The results demonstrate that JIT-BiCC outperforms all baselines, achieving a 10.8% improvement in F1-score. This highlights its effectiveness in learning the bi-modal semantics for JIT-DP.
Related papers
- Using a Sledgehammer to Crack a Nut? Revisiting Automated Compiler Fault Isolation [9.699231545580806]
This study aims to directly compare a BIC-based strategy, Basic, with representative SBFL techniques in the context of compiler fault localization.<n>Basic identifies the most recent good release and earliest bad release, and then employs a binary search to pinpoint the bug-inducing commit.<n>We rigorously compare Basic against SBFL-based techniques using a benchmark consisting of 60 GCC bugs and 60 LLVM bugs.
arXiv Detail & Related papers (2025-12-18T09:22:57Z) - PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise [60.63315470285562]
MiniTruePrefixes is a novel specialized model that better detects factual inconsistencies over text prefixes.<n>We show that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization.
arXiv Detail & Related papers (2025-11-03T09:07:44Z) - LLMBisect: Breaking Barriers in Bug Bisection with A Comparative Analysis Pipeline [35.18683484280968]
Large Language Models (LLMs) are well-positioned to break the barriers of existing solutions.<n>LLMs comprehend both textual data and code in patches and commits.<n>Our approach achieves significantly better accuracy than the state-of-the-art solution by more than 38%.
arXiv Detail & Related papers (2025-10-30T02:47:25Z) - LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets [5.191767648600372]
We investigate the utility of Large Language Models for detecting tangled code changes by leveraging both commit messages and method-level code diffs.<n>Our results demonstrate that combining commit messages with code diffs significantly enhances model performance.<n>Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods.
arXiv Detail & Related papers (2025-05-13T06:26:13Z) - Identifying Bug Inducing Commits by Combining Fault Localisation and Code Change Histories [10.027862394831669]
A Bug Inducing Commit (BIC) is a code change that introduces a bug into the commit.
We propose a technique called Fonte that aims to identify the BIC with a core concept that a commit is more likely to be a BIC if it has more recently modified code elements.
Fonte significantly outperforms state-of-the-art BIC identification techniques, achieving up to 45.8% higher MRR.
arXiv Detail & Related papers (2025-02-18T15:02:22Z) - Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models [2.5121668584771837]
Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data.
This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries.
arXiv Detail & Related papers (2024-07-03T01:09:36Z) - CCBERT: Self-Supervised Code Change Representation Learning [14.097775709587475]
CCBERT is a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes.
Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7%--14.0% in terms of different metrics and tasks.
arXiv Detail & Related papers (2023-09-27T08:17:03Z) - Pre-training Code Representation with Semantic Flow Graph for Effective
Bug Localization [4.159296619915587]
We propose a novel directed, multiple-label code graph representation named Semantic Flow Graph (SFG)
We show that our method achieves state-of-the-art performance in bug localization.
arXiv Detail & Related papers (2023-08-24T13:25:17Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - Autoregressive Belief Propagation for Decoding Block Codes [113.38181979662288]
We revisit recent methods that employ graph neural networks for decoding error correcting codes.
Our method violates the symmetry conditions that enable the other methods to train exclusively with the zero-word.
Despite not having the luxury of training on a single word, and the inability to train on more than a small fraction of the relevant sample space, we demonstrate effective training.
arXiv Detail & Related papers (2021-01-23T17:14:55Z) - CoreGen: Contextualized Code Representation Learning for Commit Message
Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen)
Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.