Investigating the Catastrophic Forgetting in Multimodal Large Language
Models
- URL: http://arxiv.org/abs/2309.10313v4
- Date: Tue, 5 Dec 2023 08:59:33 GMT
- Title: Investigating the Catastrophic Forgetting in Multimodal Large Language
Models
- Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee,
Yi Ma
- Abstract summary: We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs.
Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks.
As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
- Score: 43.89009178021342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following the success of GPT4, there has been a surge in interest in
multimodal large language model (MLLM) research. This line of research focuses
on developing general-purpose LLMs through fine-tuning pre-trained LLMs and
vision models. However, catastrophic forgetting, a notorious phenomenon where
the fine-tuned model fails to retain similar performance compared to the
pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM).
In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the
catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier.
We first apply EMT to evaluate several open-source fine-tuned MLLMs and we
discover that almost all evaluated MLLMs fail to retain the same performance
levels as their vision encoders on standard image classification tasks.
Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess
performance throughout the fine-tuning. Interestingly, our results suggest that
early-stage fine-tuning on an image dataset improves performance across other
image datasets, by enhancing the alignment of text and visual features.
However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in
a significant loss of generalizability, even when the image encoder remains
frozen. Our results suggest that MLLMs have yet to demonstrate performance on
par with their vision models on standard image classification tasks and the
current MLLM fine-tuning procedure still has room for improvement.
Related papers
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - A Comprehensive Evaluation of Large Language Models on Aspect-Based Sentiment Analysis [26.505386645322506]
Large Language Models (LLMs) have garnered increasing attention in the field of natural language processing.
In this paper, we shed light on a comprehensive evaluation of LLMs in the ABSA field, involving 13 datasets, 8 ABSA subtasks, and 6 LLMs.
Our experiments demonstrate that LLMs achieve a new state-of-the-art performance compared to fine-tuned Small Language Models (SLMs) in the fine-tuning-dependent paradigm.
arXiv Detail & Related papers (2024-12-03T08:54:17Z) - Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning [104.27224674122313]
Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks.
To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions.
arXiv Detail & Related papers (2024-11-17T01:16:37Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data [3.08543976986593]
Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks.
This paper outlines and validates GenCeption, a novel, annotation-free evaluation method.
It requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate.
arXiv Detail & Related papers (2024-02-22T21:22:04Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.