Related papers: CROME: Cross-Modal Adapters for Efficient Multimodal LLM

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

URL: http://arxiv.org/abs/2408.06610v1
Date: Tue, 13 Aug 2024 03:45:11 GMT
Title: CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Authors: Sayna Ebrahimi, Sercan O. Arik, Tejas Nama, Tomas Pfister,
Abstract summary: Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities. Existing approaches often necessitate expensive language model retraining and limited adaptability. We propose CROME, an efficient vision-language instruction tuning framework.
Score: 28.337072921099494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

Related papers

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains. In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z)
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. Training adapters with image-level supervision often results in significant misalignment. We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z)
Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning [0.0]
The integration of large language models (LLMs) with vision-language (VL) tasks has been a transformative development in the realm of artificial intelligence. We present a novel approach, termed Bottleneck Adapter, specifically crafted for enhancing the multimodal functionalities of these complex models. Our approach utilizes lightweight adapters to connect the image encoder and LLM without the need for large, complex neural networks.
arXiv Detail & Related papers (2024-07-25T06:59:15Z)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning. We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.