Investigating the Catastrophic Forgetting in Multimodal Large Language
Models
- URL: http://arxiv.org/abs/2309.10313v4
- Date: Tue, 5 Dec 2023 08:59:33 GMT
- Title: Investigating the Catastrophic Forgetting in Multimodal Large Language
Models
- Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee,
Yi Ma
- Abstract summary: We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs.
Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks.
As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
- Score: 43.89009178021342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following the success of GPT4, there has been a surge in interest in
multimodal large language model (MLLM) research. This line of research focuses
on developing general-purpose LLMs through fine-tuning pre-trained LLMs and
vision models. However, catastrophic forgetting, a notorious phenomenon where
the fine-tuned model fails to retain similar performance compared to the
pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM).
In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the
catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier.
We first apply EMT to evaluate several open-source fine-tuned MLLMs and we
discover that almost all evaluated MLLMs fail to retain the same performance
levels as their vision encoders on standard image classification tasks.
Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess
performance throughout the fine-tuning. Interestingly, our results suggest that
early-stage fine-tuning on an image dataset improves performance across other
image datasets, by enhancing the alignment of text and visual features.
However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in
a significant loss of generalizability, even when the image encoder remains
frozen. Our results suggest that MLLMs have yet to demonstrate performance on
par with their vision models on standard image classification tasks and the
current MLLM fine-tuning procedure still has room for improvement.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [49.407311947143825]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.
We also propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP)
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data [3.08543976986593]
Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks.
This paper outlines and validates GenCeption, a novel, annotation-free evaluation method.
It requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs' tendency to hallucinate.
arXiv Detail & Related papers (2024-02-22T21:22:04Z) - Mitigating Object Hallucination in Large Vision-Language Models via
Classifier-Free Guidance [56.04768229686853]
Large Vision-Language Models (LVLMs) tend to hallucinate non-existing objects in the images.
We introduce a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE)
MARINE is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling [0.0]
Recent decoder-only large language models (LLMs) perform on par with smaller state-based encoders.
We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask.
Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong-based encoders and even instruction-tuned LLMs.
arXiv Detail & Related papers (2024-01-25T22:50:48Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.