Related papers: Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning

Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning

URL: http://arxiv.org/abs/2502.12635v1
Date: Tue, 18 Feb 2025 08:28:29 GMT
Title: Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning
Authors: Yunhao Gou, Hansi Yang, Zhili Liu, Kai Chen, Yihan Zeng, Lanqing Hong, Zhenguo Li, Qun Liu, James T. Kwok, Yu Zhang,
Abstract summary: We study how corrupted data affects Multimodal Large Language Models (MLLMs)<n>We find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial.<n>We propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.
Score: 85.58172296577506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Instruction Tuning (VIT) enhances Multimodal Large Language Models (MLLMs) but it is hindered by corrupted datasets containing hallucinated content, incorrect responses, and poor OCR quality. While prior works focus on dataset refinement through high-quality data collection or rule-based filtering, they are costly or limited to specific types of corruption. To deeply understand how corrupted data affects MLLMs, in this paper, we systematically investigate this issue and find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial in that the performance of MLLMs can be largely restored by either disabling a small subset of parameters or post-training with a small amount of clean data. Additionally, corrupted MLLMs exhibit improved ability to distinguish clean samples from corrupted ones, enabling the dataset cleaning without external help. Based on those insights, we propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.

Related papers

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets [19.844836459291546]
High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or improper data integration across multiple sources. In this study, we investigate whether Large Language Models (LLMs) can help alleviate the burden of manual data cleaning.
arXiv Detail & Related papers (2025-03-09T15:29:46Z)
Are Large Language Models Good Data Preprocessors? [5.954202581988127]
High-quality textual training data is essential for the success of multimodal data processing tasks. outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods.
arXiv Detail & Related papers (2025-02-24T02:57:21Z)
Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies [8.770864706004472]
Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning.<n>This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects.<n>We find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption.
arXiv Detail & Related papers (2024-12-24T09:04:06Z)
Dissecting Representation Misalignment in Contrastive Learning via Influence Function [15.28417468377201]
We introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss.<n>ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models.<n>Building upon ECIF, we develop a series of algorithms for data evaluation, misalignment detection, and misprediction trace-back tasks.
arXiv Detail & Related papers (2024-11-18T15:45:41Z)
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection [23.378592856800168]
Large Language Models (LLMs) can be used to automate the annotation process.<n>This study investigates whether LLMs are viable for annotating the complex task of media bias detection.<n>We create annolexical, the first large-scale dataset for media bias classification.
arXiv Detail & Related papers (2024-11-17T14:14:36Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG [50.030526904378256]
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents.<n>To address this issue, we explore the task of "credibility-aware RAG"<n>We introduce a plug-and-play method named $textbfCr$edibility-aware $textbfA$ttention $textbfM$odification (CrAM)<n>Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen1.5-7B show that CrAM improves
arXiv Detail & Related papers (2024-06-17T13:01:12Z)
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z)
Purifying Large Language Models by Ensembling a Small Language Model [39.57304668057076]
We propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data. We empirically confirm the efficacy of ensembling LLMs with benign and small language models (SLMs)
arXiv Detail & Related papers (2024-02-19T14:00:39Z)
Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images. We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE) MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data. This process might suffer from privacy issues and violations of data protection regulations. We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.