Dual Debiasing for Noisy In-Context Learning for Text Generation
- URL: http://arxiv.org/abs/2506.00418v2
- Date: Sat, 21 Jun 2025 07:56:37 GMT
- Title: Dual Debiasing for Noisy In-Context Learning for Text Generation
- Authors: Siqi Liang, Sumyeong Ahn, Paramveer S. Dhillon, Jiayu Zhou,
- Abstract summary: In context learning relies heavily on high quality demonstrations drawn from large annotated corpora.<n>Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts.<n>We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity.<n>We introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score.
- Score: 35.57978360647332
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method's superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.
Related papers
- Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight [11.523154025649758]
Crowd-sourcing or web crawling introduces mismatched pairs.<n>Current approaches leverage the effect of deep neural networks to distinguish noise and perform re-weighting.<n>We propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua- Partitioning the data.
arXiv Detail & Related papers (2024-12-09T03:06:10Z) - On the Noise Robustness of In-Context Learning for Text Generation [41.59602454113563]
In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning.
To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR)
LPR replaces the "noisy" candidates with their nearest neighbors that are more likely to be clean.
arXiv Detail & Related papers (2024-05-27T15:22:58Z) - ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z) - Learning from Noisy Labels with Coarse-to-Fine Sample Credibility
Modeling [22.62790706276081]
Training deep neural network (DNN) with noisy labels is practically challenging.
Previous efforts tend to handle part or full data in a unified denoising flow.
We propose a coarse-to-fine robust learning method called CREMA to handle noisy data in a divide-and-conquer manner.
arXiv Detail & Related papers (2022-08-23T02:06:38Z) - Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data.
Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z) - Neighborhood Collective Estimation for Noisy Label Identification and
Correction [92.20697827784426]
Learning with noisy labels (LNL) aims at designing strategies to improve model performance and generalization by mitigating the effects of model overfitting to noisy labels.
Recent advances employ the predicted label distributions of individual samples to perform noise verification and noisy label correction, easily giving rise to confirmation bias.
We propose Neighborhood Collective Estimation, in which the predictive reliability of a candidate sample is re-estimated by contrasting it against its feature-space nearest neighbors.
arXiv Detail & Related papers (2022-08-05T14:47:22Z) - Centrality and Consistency: Two-Stage Clean Samples Identification for
Learning with Instance-Dependent Noisy Labels [87.48541631675889]
We propose a two-stage clean samples identification method.
First, we employ a class-level feature clustering procedure for the early identification of clean samples.
Second, for the remaining clean samples that are close to the ground truth class boundary, we propose a novel consistency-based classification method.
arXiv Detail & Related papers (2022-07-29T04:54:57Z) - Open-set Label Noise Can Improve Robustness Against Inherent Label Noise [27.885927200376386]
We show that open-set noisy labels can be non-toxic and even benefit the robustness against inherent noisy labels.
We propose a simple yet effective regularization by introducing Open-set samples with Dynamic Noisy Labels (ODNL) into training.
arXiv Detail & Related papers (2021-06-21T07:15:50Z) - Training Classifiers that are Universally Robust to All Label Noise
Levels [91.13870793906968]
Deep neural networks are prone to overfitting in the presence of label noise.
We propose a distillation-based framework that incorporates a new subcategory of Positive-Unlabeled learning.
Our framework generally outperforms at medium to high noise levels.
arXiv Detail & Related papers (2021-05-27T13:49:31Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.