Fugu-MT 論文翻訳(概要): Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

論文の概要: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

arxiv url: http://arxiv.org/abs/2401.01060v1
Date: Tue, 2 Jan 2024 06:39:00 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-03 14:33:16.746597
Title: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models
Title（参考訳）: 野放しで学ぶ: 事前学習されたコードモデルを効果的にチューニングするためにラベルなしデータを活用すること
Authors: Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, Michael R. Lyu
Abstract要約: 我々は,大規模な未ラベルデータセットを用いた事前学習型コードモデルを改善するために,HINTという新しいアプローチを提案する。 HINTには、HybrId擬似ラベル付きデータ選択とノイズ耐性トレーニングの2つの主要なモジュールが含まれている。実験の結果、HINTはタスク固有の方法でラベル付けされていないデータをうまく活用できることがわかった。
参考スコア（独自算出の注目度）: 38.7352992942213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we further propose to employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions.The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.
Abstract（参考訳）: 事前訓練されたコードモデルは、最近多くのコードインテリジェンスタスクで大幅に改善されている。これらのモデルは、まず、自己教師付き学習を使用してタスクに依存しない大規模未ラベルデータセット上で事前トレーニングされ、その後、下流タスクでラベル付きデータセットに微調整される。しかしながら、ラベル付きデータセットは通常、サイズ(つまり人間の集中的な努力)に制限があり、特定のタスクにおける事前学習されたコードモデルのパフォーマンスを阻害する可能性がある。これを緩和するために考えられる1つの解決策は、擬似ラベルによるチューニングステージでの大規模非ラベルデータを活用することである。しかし、擬似ラベルデータを直接利用すると、大量のノイズ、すなわち不正なラベルが発生するため、準最適性能が得られる。本稿では,疑似ラベルデータを活用した大規模非ラベルデータセットを用いた事前学習型コードモデルを改善するための新しい手法を提案する。 HINTには、HybrId擬似ラベル付きデータ選択とノイズ耐性トレーニングの2つの主要なモジュールが含まれている。ハイブリッド擬似データ選択モジュールでは、トレーニング損失による擬似ラベルの品質を直接測定することとは別に、ロバスト性の問題を考慮して、低品質な擬似ラベルデータをフィルタリングする検索手法を提案する。 The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions.The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.

関連論文リスト

Early Stopping Against Label Noise Without Validation Data [54.27621957395026]
所望のモデルを選択するのに検証データを必要としないラベルウェーブと呼ばれる新しい早期停止手法を提案する。各種設定におけるラベルウェーブ法の有効性と,ノイズラベルを用いた学習における既存手法の性能向上を両立させる能力について述べる。
論文参考訳（メタデータ） (2025-02-11T13:40:15Z)
Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling [6.861041888341339]
そこで本研究では,リウェイトトレーニングサンプルに対する非教師なしメタロス再スケーリングを提案する。我々は,対話モデリングの課題を生かした,初級訓練データの再重み付けを試みている。我々の戦略は、ノイズの多いクリーンなデータに直面し、クラス不均衡を処理し、ノイズの多いラベルへの過度な適合を防ぐ。
論文参考訳（メタデータ） (2024-12-17T14:37:50Z)
Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy [34.02350195269502]
我々は再ラベルでデータプルーニングの問題を定式化する。そこで本研究では,すべてのトレーニング例の局所的信頼度を最大化する,新しいデータプルーニングアルゴリズムPrune4Relを提案する。
論文参考訳（メタデータ） (2023-11-02T05:40:26Z)
Boosting Semi-Supervised Learning by bridging high and low-confidence predictions [4.18804572788063]
Pseudo-labelingは半教師あり学習(SSL)において重要な技術である ReFixMatchと呼ばれる新しい手法を提案し、これはトレーニング中にラベルなしのデータをすべて活用することを目的としている。
論文参考訳（メタデータ） (2023-08-15T00:27:18Z)
Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data [70.25049762295193]
本稿では,トレーニング中にノイズラベル付きおよび未処理データを受け入れる条件付き画像生成フレームワークを提案する。本稿では,ラベルのないデータに新たなラベルを割り当てながら,逆行訓練にインスタンスワイドを割り当てるソフトカリキュラム学習を提案する。実験により,本手法は,定量および定性性能の両面において,既存の半教師付き・ラベル付きロバストな手法より優れていることが示された。
論文参考訳（メタデータ） (2023-07-17T08:31:59Z)
Learning with Noisy Labels by Adaptive Gradient-Based Outlier Removal [4.71154003227418]
本稿では,Adaptive GRAdient-based outlier removal を用いて,雑音のあるラベルで学習する新しい手法 AGRAを提案する。本手法は,サンプルの集合勾配と個々のサンプル勾配を比較して,対応するサンプルがモデルに役立つかどうかを動的に決定する。いくつかのデータセットに対する広範囲な評価はAGRAの有効性を示している。
論文参考訳（メタデータ） (2023-06-07T15:10:01Z)
Pseudo-Label Noise Suppression Techniques for Semi-Supervised Semantic Segmentation [21.163070161951868]
半消費学習(SSL)は、教師なしデータをトレーニングに組み込むことで、大きなラベル付きデータセットの必要性を減らすことができる。現在のSSLアプローチでは、初期教師付きトレーニングモデルを使用して、擬似ラベルと呼ばれる未ラベル画像の予測を生成する。擬似ラベルノイズと誤りを3つのメカニズムで制御する。
論文参考訳（メタデータ） (2022-10-19T09:46:27Z)
Debiased Pseudo Labeling in Self-Training [77.83549261035277]
ディープニューラルネットワークは、大規模ラベル付きデータセットの助けを借りて、幅広いタスクで顕著なパフォーマンスを達成する。ラベル付きデータの要求を軽減するため、ラベル付けされていないデータに擬似ラベルを付けることにより、学術と産業の両方で自己学習が広く使われている。疑似ラベルの生成と利用を2つの独立した頭文字で分離するデバイアスドを提案する。
論文参考訳（メタデータ） (2022-02-15T02:14:33Z)
Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
我々は、ラベルのない例を使ってモデルをトレーニングする半教師付き学習(SSL)アプローチを提案する。提案手法であるDashは、ラベルなしデータ選択の観点から適応性を享受する。
論文参考訳（メタデータ） (2021-09-01T23:52:29Z)
Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
セルフチューニングは、データ効率のよいディープラーニングを可能にする新しいアプローチである。ラベル付きおよびラベルなしデータの探索と事前訓練されたモデルの転送を統一する。 SSLとTLの5つのタスクをシャープなマージンで上回ります。
論文参考訳（メタデータ） (2021-02-25T14:56:19Z)
Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain Adaptation [87.60688582088194]
新規な自己監督雑音ラベル学習法を提案する。本手法は最新の結果を容易に達成でき、他の手法を非常に大きなマージンで超えることができる。
論文参考訳（メタデータ） (2021-02-23T10:51:45Z)
Improving Generalization of Deep Fault Detection Models in the Presence of Mislabeled Data [1.3535770763481902]
ラベルノイズを用いた頑健なトレーニングのための新しい2段階フレームワークを提案する。最初のステップでは、仮説空間の更新に基づいて、外れ値(ラベルのつかないサンプルを含む)を識別する。第2のステップでは、識別されたアウトレイラとデータ拡張技術に基づいて、トレーニングデータを修正するための異なるアプローチを提案する。
論文参考訳（メタデータ） (2020-09-30T12:33:25Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。