Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?
- URL: http://arxiv.org/abs/2311.18021v2
- Date: Sat, 07 Dec 2024 15:34:23 GMT
- Title: Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?
- Authors: Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, Volker Tresp, Jindong Gu,
- Abstract summary: Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos)
Recently, Multimodal Large Language Models (MLLMs) have also shown multimodal ICL ability, responding to queries given a few multimodal demos, including images, queries, and answers.
- Score: 42.03008819332293
- License:
- Abstract: Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method. Code is available at \url{https://chenxshuo.github.io/m-icl/}.
Related papers
- Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models [15.622219099903067]
We find that changing the order of multimodal input can cause the model's performance to fluctuate between advanced performance and random guessing.
This phenomenon exists in both single-modality (text-only or image-only) and mixed-modality (image-text-pair) contexts.
We propose a new metric, Position-Invariant Accuracy (PIA), to address order bias in MLLM evaluation.
arXiv Detail & Related papers (2024-10-22T13:05:11Z) - AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning [15.770849688170477]
In-context learning (ICL) facilitates Large Language Models exhibiting emergent ability on downstream tasks without updating billions of parameters.
Most primary MLLMs are only trained on single-image datasets, making them unable to read multi-modal demonstrations.
We propose a general and light-weighted framework textbfAIM to tackle the mentioned problems through textbfAggregating textbfImage information of textbfMultimodal demonstrations to the dense latent space of the corresponding linguistic part.
arXiv Detail & Related papers (2024-06-11T08:12:43Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - What Makes Multimodal In-Context Learning Work? [58.48612721156335]
We present a framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models.
M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality.
We identify several biases and limitations of M-ICL that warrant consideration prior to deployment.
arXiv Detail & Related papers (2024-04-24T08:50:45Z) - MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.
We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently.
Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.