Related papers: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

URL: http://arxiv.org/abs/2401.01060v1
Date: Tue, 2 Jan 2024 06:39:00 GMT
Title: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models
Authors: Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, Michael R. Lyu
Abstract summary: We propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. The experimental results show that HINT can better leverage those unlabeled data in a task-specific way.
Score: 38.7352992942213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we further propose to employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions.The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.

Related papers

Early Stopping Against Label Noise Without Validation Data [54.27621957395026]
We propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model. We show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.
arXiv Detail & Related papers (2025-02-11T13:40:15Z)
Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling [6.861041888341339]
We propose unsupervised on-the-fly meta loss rescaling to reweight training samples. We are among the first to attempt on-the-fly training data reweighting on the challenging task of dialogue modeling. Our strategy is robust in the face of noisy and clean data, handles class imbalance, and prevents overfitting to noisy labels.
arXiv Detail & Related papers (2024-12-17T14:37:50Z)
Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy [34.02350195269502]
We formalize the problem of data pruning with re-labeling. We propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples.
arXiv Detail & Related papers (2023-11-02T05:40:26Z)
Boosting Semi-Supervised Learning by bridging high and low-confidence predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL) We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z)
Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training. We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data. Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z)
Learning with Noisy Labels by Adaptive Gradient-Based Outlier Removal [4.71154003227418]
We propose AGRA: a new method for learning with noisy labels by using Adaptive GRAdient-based outlier removal. By comparing the aggregated gradient of a batch of samples and an individual example gradient, our method dynamically decides whether a corresponding example is helpful for the model. Extensive evaluation on several datasets demonstrates AGRA's effectiveness.
arXiv Detail & Related papers (2023-06-07T15:10:01Z)
Pseudo-Label Noise Suppression Techniques for Semi-Supervised Semantic Segmentation [21.163070161951868]
Semi-consuming learning (SSL) can reduce the need for large labelled datasets by incorporating unsupervised data into the training. Current SSL approaches use an initially supervised trained model to generate predictions for unlabelled images, called pseudo-labels. We use three mechanisms to control pseudo-label noise and errors.
arXiv Detail & Related papers (2022-10-19T09:46:27Z)
Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data. We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z)
Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models. Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z)
Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning. It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model. It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z)
Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method. Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z)
Improving Generalization of Deep Fault Detection Models in the Presence of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise. In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space. In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.