How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs
- URL: http://arxiv.org/abs/2509.19325v1
- Date: Sat, 13 Sep 2025 18:55:52 GMT
- Title: How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs
- Authors: Jian Ouyang, Arman T, Ge Jin,
- Abstract summary: This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs)<n>We evaluate models fine-tuned with varying ratios of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal.<n>A clear threshold of at least 50% correct data is needed for models to consistently recover strong performance.
- Score: 2.4794014826920363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs), specifically gpt-4o, during supervised fine-tuning (SFT). Although LLMs become increasingly vital across broad domains like finance, coding, law, and health, fine-tuning on incorrect data can lead to "emergent misalignment," producing harmful or deceptive outputs unrelated to the intended task. We evaluate gpt-4o models fine-tuned with varying ratios (10\% to 90\% correct) of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal. Our findings show that even modest amounts of incorrect data (10-25\%) dramatically degrade domain performance and not moral alignment. A clear threshold of at least 50\% correct data is needed for models to consistently recover strong performance, though they rarely match the robustness and safety of the base model, which exhibits near-perfect alignment and zero dangerous completions out-of-the-box. This research emphasizes that the cost of incorrect data is heavy, highlighting the critical need for extremely high-quality data curation or, alternatively, leveraging robust base models without unnecessary fine-tuning for high-stakes applications.
Related papers
- Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data [10.698357983420928]
This work aims to improve the robustness of Large Language Models against potential adversarial inputs.<n>We systematically evaluated robustness by fine-tuning models using datasets perturbed at character-level, word-level, and sentence-level.<n>Fine-tuning models with perturbed datasets significantly improves model robustness (RD usually drops around 4% - 6%), especially for models with relatively weak robustness.
arXiv Detail & Related papers (2026-02-11T22:30:01Z) - Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM [51.21051698747157]
We propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of large language models (LLMs)<n>Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model's learning process.<n>Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness.
arXiv Detail & Related papers (2025-11-07T08:34:50Z) - Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z) - PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare [2.1046377530356764]
Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task.<n>PatientDx is a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data.
arXiv Detail & Related papers (2025-04-24T08:21:04Z) - DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning [22.704995231753397]
Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation.<n>We propose DONOD, a lightweight model-intrinsic data pruning method.<n>By filtering out 70% of the whole dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
arXiv Detail & Related papers (2025-04-21T02:25:03Z) - Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning [92.16191092329765]
We investigate the impact of corrupted data on Multimodal Large Language Models (MLLMs)<n>We find that, although corrupted data degrade model performance, such adverse effects are largely reversible.<n>We introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.
arXiv Detail & Related papers (2025-02-18T08:28:29Z) - Leveraging Web-Crawled Data for High-Quality Fine-Tuning [24.19939701706869]
We argue that web-crawled data can still serve as a valuable source for high-quality supervised fine-tuning without relying on advanced models like GPT-4.
We create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data.
Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems.
arXiv Detail & Related papers (2024-08-15T08:12:52Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Parameter-tuning-free data entry error unlearning with adaptive
selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning.
We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks.
The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z) - SciFix: Outperforming GPT3 on Scientific Factual Error Correction [9.850216012914684]
SciFix is a scientific claim correction system that does not require a verifier but can outperform existing methods by a considerable margin.
Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset.
arXiv Detail & Related papers (2023-05-24T04:24:16Z) - Unsupervised Robust Domain Adaptation without Source Data [75.85602424699447]
We study the problem of robust domain adaptation in the context of unavailable target labels and source data.
We show a consistent performance improvement of over $10%$ in accuracy against the tested baselines on four benchmark datasets.
arXiv Detail & Related papers (2021-03-26T16:42:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.