Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning
- URL: http://arxiv.org/abs/2511.08246v1
- Date: Wed, 12 Nov 2025 01:48:29 GMT
- Title: Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning
- Authors: Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai,
- Abstract summary: We propose a Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert.<n>Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion.<n>Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert.
- Score: 57.082554323521464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.
Related papers
- Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z) - Training-Free Multimodal Deepfake Detection via Graph Reasoning [16.774618707890834]
Multimodal deepfake detection (MDD) aims to uncover manipulations across visual, textual, and auditory modalities.<n>We propose Guided Adaptive Scorer and Propagation In-Context Learning (GASP-ICL), a training-free framework for MDD.
arXiv Detail & Related papers (2025-09-26T02:22:12Z) - Towards Agentic AI for Multimodal-Guided Video Object Segmentation [14.877182670778284]
Referring-based Video Object is a multimodal problem that requires producing fine-grained segmentation results guided by external cues.<n>Recent advances in vision-language foundation models open a promising direction toward training-free approaches.<n>We propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner.
arXiv Detail & Related papers (2025-08-14T12:11:15Z) - Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition [33.22316608406554]
We propose a parameter-efficient dual adaptation method for both image and text encoders.<n>Specifically, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction.<n>We develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions.
arXiv Detail & Related papers (2025-05-09T12:34:10Z) - A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection.
There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task.
We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z) - Learning Multi-Aspect Item Palette: A Semantic Tokenization Framework for Generative Recommendation [55.99632509895994]
We introduce LAMIA, a novel approach for multi-aspect semantic tokenization.<n>Unlike RQ-VAE, which uses a single embedding, LAMIA learns an item palette''--a collection of independent and semantically parallel embeddings.<n>Our results demonstrate significant improvements in recommendation accuracy over existing methods.
arXiv Detail & Related papers (2024-09-11T13:49:48Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks.
They often lack interpretability and struggle with complex visual inputs.
We introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs.
We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.