Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
- URL: http://arxiv.org/abs/2406.15334v1
- Date: Fri, 21 Jun 2024 17:50:02 GMT
- Title: Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
- Authors: Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig,
- Abstract summary: In-context learning with many examples can be promising for learning new tasks.
It is fundamentally limited by the model's context length set at pretraining.
This motivates the need for a method to compress many shots into fewer tokens without finetuning.
- Score: 54.74986983905282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
Related papers
- SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization [49.931663904599205]
Researchers have developed techniques to develop Large Multimodal Models with In-Context Learning capabilities.
Existing LMMs fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns.
We propose Symbol Demonstration Direct Preference Optimization (SymDPO) to break the traditional paradigm of constructing multimodal demonstrations.
arXiv Detail & Related papers (2024-11-17T08:29:14Z) - Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models [15.622219099903067]
We find that changing the order of multimodal input can cause the model's performance to fluctuate between advanced performance and random guessing.
This phenomenon exists in both single-modality (text-only or image-only) and mixed-modality (image-text-pair) contexts.
We propose a new metric, Position-Invariant Accuracy (PIA), to address order bias in MLLM evaluation.
arXiv Detail & Related papers (2024-10-22T13:05:11Z) - What Makes Multimodal In-Context Learning Work? [58.48612721156335]
We present a framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models.
M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality.
We identify several biases and limitations of M-ICL that warrant consideration prior to deployment.
arXiv Detail & Related papers (2024-04-24T08:50:45Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.