Related papers: MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

URL: http://arxiv.org/abs/2312.06363v2
Date: Tue, 12 Dec 2023 06:53:27 GMT
Title: MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples
Authors: Tao Chen, Enwei Zhang, Yuting Gao, Ke Li, Xing Sun, Yan Zhang and Hui Li
Abstract summary: Multi-Modal In-Context Tuning (MMICT) is a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning. We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
Score: 30.284100018891397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input.

Related papers

True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains. In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z)
What Makes Multimodal In-Context Learning Work? [58.48612721156335]
We present a framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. We identify several biases and limitations of M-ICL that warrant consideration prior to deployment.
arXiv Detail & Related papers (2024-04-24T08:50:45Z)
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF) In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z)
Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models. When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z)
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [33.072967177313025]
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B) We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
arXiv Detail & Related papers (2023-09-27T22:50:51Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.