Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively
Tuning Pre-trained Code Models
- URL: http://arxiv.org/abs/2401.01060v1
- Date: Tue, 2 Jan 2024 06:39:00 GMT
- Title: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively
Tuning Pre-trained Code Models
- Authors: Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, Michael
R. Lyu
- Abstract summary: We propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets.
HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training.
The experimental results show that HINT can better leverage those unlabeled data in a task-specific way.
- Score: 38.7352992942213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained code models have recently achieved substantial improvements in
many code intelligence tasks. These models are first pre-trained on large-scale
unlabeled datasets in a task-agnostic manner using self-supervised learning,
and then fine-tuned on labeled datasets in downstream tasks. However, the
labeled datasets are usually limited in size (i.e., human intensive efforts),
which may hinder the performance of pre-trained code models in specific tasks.
To mitigate this, one possible solution is to leverage the large-scale
unlabeled data in the tuning stage by pseudo-labeling. However, directly
employing the pseudo-labeled data can bring a large amount of noise, i.e.,
incorrect labels, leading to suboptimal performance. How to effectively
leverage the noisy pseudo-labeled data is a challenging yet under-explored
problem.In this paper, we propose a novel approach named HINT to improve
pre-trained code models with large-scale unlabeled datasets by better utilizing
the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled
data selection and Noise-tolerant Training. In the hybrid pseudo-data selection
module, considering the robustness issue, apart from directly measuring the
quality of pseudo labels through training loss, we further propose to employ a
retrieval-based method to filter low-quality pseudo-labeled data. The
noise-tolerant training module aims to further mitigate the influence of errors
in pseudo labels by training the model with a noise-tolerant loss function and
by regularizing the consistency of model predictions.The experimental results
show that HINT can better leverage those unlabeled data in a task-specific way
and provide complementary benefits for pre-trained models, e.g., improving the
best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect
detection, and assertion generation, respectively.
Related papers
- Robust Data Pruning under Label Noise via Maximizing Re-labeling
Accuracy [34.02350195269502]
We formalize the problem of data pruning with re-labeling.
We propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples.
arXiv Detail & Related papers (2023-11-02T05:40:26Z) - Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Learning with Noisy Labels by Adaptive Gradient-Based Outlier Removal [4.71154003227418]
We propose AGRA: a new method for learning with noisy labels by using Adaptive GRAdient-based outlier removal.
By comparing the aggregated gradient of a batch of samples and an individual example gradient, our method dynamically decides whether a corresponding example is helpful for the model.
Extensive evaluation on several datasets demonstrates AGRA's effectiveness.
arXiv Detail & Related papers (2023-06-07T15:10:01Z) - Pseudo-Label Noise Suppression Techniques for Semi-Supervised Semantic
Segmentation [21.163070161951868]
Semi-consuming learning (SSL) can reduce the need for large labelled datasets by incorporating unsupervised data into the training.
Current SSL approaches use an initially supervised trained model to generate predictions for unlabelled images, called pseudo-labels.
We use three mechanisms to control pseudo-label noise and errors.
arXiv Detail & Related papers (2022-10-19T09:46:27Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain
Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method.
Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.