When Bad Data Leads to Good Models
- URL: http://arxiv.org/abs/2505.04741v1
- Date: Wed, 07 May 2025 19:17:49 GMT
- Title: When Bad Data Leads to Good Models
- Authors: Kenneth Li, Yida Chen, Fernanda ViƩgas, Martin Wattenberg,
- Abstract summary: In large language model (LLM) pretraining, data quality is believed to determine model quality.<n>We re-examine the notion of "quality" from the perspective of pre- and post-training co-design.
- Score: 44.897123018926486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
Related papers
- Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA [0.0]
This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset.<n>We offer guidelines for implementing robust toxicity detection pipelines.
arXiv Detail & Related papers (2025-05-09T18:01:50Z) - PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning [32.508939142492004]
We introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning.
Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases.
We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models.
arXiv Detail & Related papers (2024-10-11T13:50:50Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity [6.786565820048478]
We introduce a tuning-free alignment alternative, ProFS, and demonstrate its effectiveness under the use case of toxicity reduction.<n>ProFS identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace.<n>We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data.
arXiv Detail & Related papers (2024-05-22T20:08:48Z) - Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented
Models [11.805944680474823]
Goodtriever is a flexible methodology that matches the current state-of-the-art toxicity mitigation.
By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation.
arXiv Detail & Related papers (2023-10-11T15:30:35Z) - On Practical Aspects of Aggregation Defenses against Data Poisoning
Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning.
Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness.
Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations [15.152559543181523]
This study is the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection.
We demonstrate that model-agnostic lexical substitutions significantly hurt performance.
Augmentations proposed in prior work on toxicity prove to be less effective.
arXiv Detail & Related papers (2022-01-17T12:48:27Z) - ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media.
We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption.
We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.