KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation
- URL: http://arxiv.org/abs/2401.08376v1
- Date: Tue, 16 Jan 2024 14:07:48 GMT
- Title: KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation
- Authors: Wei Tao, Yucheng Zhou, Yanlin Wang, Hongyu Zhang, Haofen Wang,
Wenqiang Zhang
- Abstract summary: We propose a novel knowledge-aware denoising learning method called KADEL.
Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits.
Our method achieves overall state-of-the-art performance compared with previous methods.
- Score: 43.8807366757381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commit messages are natural language descriptions of code changes, which are
important for software evolution such as code understanding and maintenance.
However, previous methods are trained on the entire dataset without considering
the fact that a portion of commit messages adhere to good practice (i.e.,
good-practice commits), while the rest do not. On the basis of our empirical
study, we discover that training on good-practice commits significantly
contributes to the commit message generation. Motivated by this finding, we
propose a novel knowledge-aware denoising learning method called KADEL.
Considering that good-practice commits constitute only a small proportion of
the dataset, we align the remaining training samples with these good-practice
commits. To achieve this, we propose a model that learns the commit knowledge
by training on good-practice commits. This knowledge model enables
supplementing more information for training samples that do not conform to good
practice. However, since the supplementary information may contain noise or
prediction errors, we propose a dynamic denoising training method. This method
composes a distribution-aware confidence function and a dynamic distribution
list, which enhances the effectiveness of the training process. Experimental
results on the whole MCMD dataset demonstrate that our method overall achieves
state-of-the-art performance compared with previous methods. Our source code
and data are available at https://github.com/DeepSoftwareAnalytics/KADEL
Related papers
- EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training [79.96741042766524]
We reformulate the training curriculum as a soft-selection function.
We show that exposing the contents of natural images can be readily achieved by the intensity of data augmentation.
The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective.
arXiv Detail & Related papers (2024-05-14T17:00:43Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Learning Representations for New Sound Classes With Continual
Self-Supervised Learning [30.35061954854764]
We present a self-supervised learning framework for continually learning representations for new sound classes.
We show that representations learned with the proposed method generalize better and are less susceptible to catastrophic forgetting.
arXiv Detail & Related papers (2022-05-15T22:15:21Z) - Training Dynamics for Text Summarization Models [45.62439188988816]
We analyze the training dynamics for generation models, focusing on news summarization.
Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, we study what the model learns at different stages of its fine-tuning process.
We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains.
On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains.
arXiv Detail & Related papers (2021-10-15T21:13:41Z) - Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training.
At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z) - Coded Machine Unlearning [34.08435990347253]
We present a coded learning protocol where the dataset is linearly coded before the learning phase.
We also present the corresponding unlearning protocol for the coded learning model along with a discussion on the proposed protocol's success in ensuring perfect unlearning.
arXiv Detail & Related papers (2020-12-31T17:20:34Z) - Teaching with Commentaries [108.62722733649542]
We propose a flexible teaching framework using commentaries and learned meta-information.
We find that commentaries can improve training speed and/or performance.
commentaries can be reused when training new models to obtain performance benefits.
arXiv Detail & Related papers (2020-11-05T18:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.