TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
- URL: http://arxiv.org/abs/2505.17098v1
- Date: Wed, 21 May 2025 05:22:21 GMT
- Title: TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
- Authors: Yanshu Li, Tian Yun, Jianjiang Yang, Pinyuan Feng, Jinfa Huang, Ruixiang Tang,
- Abstract summary: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs)<n>We present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences.<n>Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks.
- Score: 11.724886737930671
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.
Related papers
- Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications [42.945657927971]
This paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks.<n>To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs.<n>Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear.
arXiv Detail & Related papers (2025-07-28T09:33:12Z) - True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z) - Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z) - Enhancing Cross-task Transfer of Large Language Models via Activation Steering [75.41750053623298]
Cross-task in-context learning offers a direct solution for transferring knowledge across tasks.<n>We investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion.<n>We propose a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states.
arXiv Detail & Related papers (2025-07-17T15:47:22Z) - Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z) - Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations [0.0]
Multimodal in-context learning (ICL) has emerged as a key capability of Large Vision-Language Models (LVLMs)<n>We shed light on the core mechanism underlying multimodal ICL, identifying task mapping as a crucial factor in configuring robust in-context demonstration sequences.<n>We propose textitSabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention.
arXiv Detail & Related papers (2025-03-05T16:33:10Z) - Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.<n>By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance.<n>Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA [9.450927573476822]
textitAgentPS is a novel framework that integrates Agentic Process Supervision into MLLMs via multi-round question answering during fine-tuning.<n>textitAgentPS demonstrates significant performance improvements over baseline MLLMs on proprietary TikTok datasets.
arXiv Detail & Related papers (2024-12-15T04:58:00Z) - TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.5952731807559]
We present TACO, a family of multi-modal large action models designed to improve performance on complex, multi-step, and multi-modal tasks.<n>During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator.<n>This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - Multimodal Contrastive In-Context Learning [0.9120312014267044]
This paper introduces a novel multimodal contrastive in-context learning framework to enhance our understanding of gradient-free in-context learning (ICL) in Large Language Models (LLMs)
First, we present a contrastive learning-based interpretation of ICL in real-world settings, marking the distance of the key-value representation as the differentiator in ICL.
Second, we develop an analytical framework to address biases in multimodal input formatting for real-world datasets.
Third, we propose an on-the-fly approach for ICL that demonstrates effectiveness in detecting hateful memes.
arXiv Detail & Related papers (2024-08-23T10:10:01Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.