Understanding the Effect of Noise in LLM Training Data with Algorithmic
Chains of Thought
- URL: http://arxiv.org/abs/2402.04004v2
- Date: Fri, 9 Feb 2024 01:56:38 GMT
- Title: Understanding the Effect of Noise in LLM Training Data with Algorithmic
Chains of Thought
- Authors: Alex Havrilla, Maia Iyer
- Abstract summary: We study how noise in chain of thought impacts task performance in highly-controlled setting.
We define two types of noise: textitstatic noise, a local form of noise which is applied after the CoT trace is computed, and textitdynamic noise, a global form of noise which propagates errors in the trace as it is computed.
We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During both pretraining and fine-tuning, Large Language Models
(\textbf{LLMs}) are trained on trillions of tokens of text of widely varying
quality. Both phases of training typically involve heuristically filtering out
``low-quality'' or \textit{noisy} training samples, yet little is known
quantitatively about how the type or intensity of noise affects downstream
performance. In this work, we study how noise in chain of thought
(\textbf{CoT}) impacts task performance in the highly-controlled setting of
algorithmically solvable tasks. First, we develop the Traced Integer
(\textbf{TInt}) framework to generate highly customizable noised execution
traces for any arithmetic function on lists of integers. We then define two
types of noise: \textit{static} noise, a local form of noise which is applied
after the CoT trace is computed, and \textit{dynamic} noise, a global form of
noise which propagates errors in the trace as it is computed. We then evaluate
the test performance of pretrained models both prompted and fine-tuned on
noised datasets with varying levels of dataset contamination and intensity. We
find fine-tuned models are extremely robust to high levels of static noise but
struggle significantly more with lower levels of dynamic noise. In contrast,
few-shot prompted models appear more sensitive to even static noise. We
conclude with a discussion of how our findings impact noise filtering
best-practices, in particular emphasizing the importance of removing samples
containing destructive dynamic noise with global errors.
Related papers
- NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification [7.464154519547575]
Existing research on learning with noisy labels predominantly focuses on synthetic noise patterns.
We constructed a benchmark dataset to better understand label noise in real-world text classification settings.
Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise.
arXiv Detail & Related papers (2024-07-09T06:18:40Z) - NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition [3.726602636064681]
We present an analysis that shows that real noise is significantly more challenging than simulated noise.
We show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound.
arXiv Detail & Related papers (2024-05-13T10:20:31Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Noisy Pair Corrector for Dense Retrieval [59.312376423104055]
We propose a novel approach called Noisy Pair Corrector (NPC)
NPC consists of a detection module and a correction module.
We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.
arXiv Detail & Related papers (2023-11-07T08:27:14Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Improving the Robustness of Summarization Models by Detecting and
Removing Input Noise [50.27105057899601]
We present a large empirical study quantifying the sometimes severe loss in performance from different types of input noise for a range of datasets and model sizes.
We propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any training, auxiliary models, or even prior knowledge of the type of noise.
arXiv Detail & Related papers (2022-12-20T00:33:11Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Label noise detection under the Noise at Random model with ensemble
filters [5.994719700262245]
This work investigates the performance of ensemble noise detection under two different noise models.
We investigate the effect of class distribution on noise detection performance since it changes the total noise level observed in a dataset.
arXiv Detail & Related papers (2021-12-02T21:49:41Z) - Training Classifiers that are Universally Robust to All Label Noise
Levels [91.13870793906968]
Deep neural networks are prone to overfitting in the presence of label noise.
We propose a distillation-based framework that incorporates a new subcategory of Positive-Unlabeled learning.
Our framework generally outperforms at medium to high noise levels.
arXiv Detail & Related papers (2021-05-27T13:49:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.