Related papers: Boosting Commit Classification with Contrastive Learning

Boosting Commit Classification with Contrastive Learning

URL: http://arxiv.org/abs/2308.08263v1
Date: Wed, 16 Aug 2023 10:02:36 GMT
Title: Boosting Commit Classification with Contrastive Learning
Authors: Jiajun Tong, Zhixiao Wang and Xiaobin Rui
Abstract summary: Commit Classification (CC) is an important task in software maintenance. We propose a contrastive learning-based commit classification framework. Our framework can solve the CC problem simply but effectively in fewshot scenarios.
Score: 0.8655526882770742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Commit Classification (CC) is an important task in software maintenance, which helps software developers classify code changes into different types according to their nature and purpose. It allows developers to understand better how their development efforts are progressing, identify areas where they need improvement, and make informed decisions about when and how to release new software versions. However, existing models need lots of manually labeled data for fine-tuning processes, and ignore sentence-level semantic information, which is often essential for discovering the difference between diverse commits. Therefore, it is still challenging to solve CC in fewshot scenario. To solve the above problems, we propose a contrastive learning-based commit classification framework. Firstly, we generate $K$ sentences and pseudo-labels according to the labels of the dataset, which aims to enhance the dataset. Secondly, we randomly group the augmented data $N$ times to compare their similarity with the positive $T_p^{|C|}$ and negative $T_n^{|C|}$ samples. We utilize individual pretrained sentence transformers (ST)s to efficiently obtain the sentence-level embeddings from different features respectively. Finally, we adopt the cosine similarity function to limit the distribution of vectors, similar vectors are more adjacent. The light fine-tuned model is then applied to the label prediction of incoming commits. Extensive experiments on two open available datasets demonstrate that our framework can solve the CC problem simply but effectively in fewshot scenarios, while achieving state-of-the-art(SOTA) performance and improving the adaptability of the model without requiring a large number of training samples for fine-tuning. The code, data, and trained models are available at https://github.com/AppleMax1992/CommitFit.

Related papers

Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough? [11.865113785648932]
We explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets. We examine multiple levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets.
arXiv Detail & Related papers (2025-02-11T15:02:19Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z)
Incorprating Prompt tuning for Commit classification with prior Knowledge [0.76146285961466]
Commit Classification(CC) is an important task in software maintenance. We propose a generative framework that incorporates prompt-tuning for commit classification with prior knowledge. Our framework can solve the CC problem simply but effectively in few-shot and zeroshot scenarios.
arXiv Detail & Related papers (2023-08-21T09:17:43Z)
Distinguishability Calibration to In-Context Learning [31.375797763897104]
We propose a method to map a PLM-encoded embedding into a new metric space to guarantee the distinguishability of the resulting embeddings. We also take the advantage of hyperbolic embeddings to capture the hierarchical relations among fine-grained class-associated token embedding.
arXiv Detail & Related papers (2023-02-13T09:15:00Z)
Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV) NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z)
Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples. A typical alternative is learning from multiple noisy annotators. This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z)
Improving Contrastive Learning on Imbalanced Seed Data via Open-World Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK) MAK follows three simple principles: tailness, proximity, and diversity. We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z)
Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios. For dataset bias due to different samplers, we propose shifted batch normalization. Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
Imbalance Learning for Variable Star Classification [0.0]
We develop a hierarchical machine learning classification scheme to overcome imbalanced learning problems. We use 'data-level' approaches to directly augment the training data so that they better describe under-represented classes. We find that a higher classification rate is obtained when using $texttGpFit$ in the hierarchical model.
arXiv Detail & Related papers (2020-02-27T19:01:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.