Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning
- URL: http://arxiv.org/abs/2502.12635v1
- Date: Tue, 18 Feb 2025 08:28:29 GMT
- Title: Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning
- Authors: Yunhao Gou, Hansi Yang, Zhili Liu, Kai Chen, Yihan Zeng, Lanqing Hong, Zhenguo Li, Qun Liu, James T. Kwok, Yu Zhang,
- Abstract summary: We study how corrupted data affects Multimodal Large Language Models (MLLMs)
We find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial.
We propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.
- Score: 85.58172296577506
- License:
- Abstract: Visual Instruction Tuning (VIT) enhances Multimodal Large Language Models (MLLMs) but it is hindered by corrupted datasets containing hallucinated content, incorrect responses, and poor OCR quality. While prior works focus on dataset refinement through high-quality data collection or rule-based filtering, they are costly or limited to specific types of corruption. To deeply understand how corrupted data affects MLLMs, in this paper, we systematically investigate this issue and find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial in that the performance of MLLMs can be largely restored by either disabling a small subset of parameters or post-training with a small amount of clean data. Additionally, corrupted MLLMs exhibit improved ability to distinguish clean samples from corrupted ones, enabling the dataset cleaning without external help. Based on those insights, we propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.
Related papers
- Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies [8.770864706004472]
Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning.
This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects.
We find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption.
arXiv Detail & Related papers (2024-12-24T09:04:06Z) - The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection [23.378592856800168]
Large Language Models (LLMs) can be used to automate the annotation process.
This study investigates whether LLMs are viable for annotating the complex task of media bias detection.
We create annolexical, the first large-scale dataset for media bias classification.
arXiv Detail & Related papers (2024-11-17T14:14:36Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG [50.030526904378256]
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents.
To address this issue, we explore the task of "credibility-aware RAG"
We introduce a plug-and-play method named $textbfCr$edibility-aware $textbfA$ttention $textbfM$odification (CrAM)
Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen1.5-7B show that CrAM improves
arXiv Detail & Related papers (2024-06-17T13:01:12Z) - SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning [16.307467144690683]
Large Language Models can achieve desirable performance with only a small amount of high-quality data.
Identifying high-quality data from vast datasets to curate small yet effective datasets has emerged as a critical challenge.
We introduce SHED, an automated dataset refinement framework based on Shapley value for instruction fine-tuning.
arXiv Detail & Related papers (2024-04-23T04:56:48Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - Purifying Large Language Models by Ensembling a Small Language Model [39.57304668057076]
We propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data.
We empirically confirm the efficacy of ensembling LLMs with benign and small language models (SLMs)
arXiv Detail & Related papers (2024-02-19T14:00:39Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.