Test-Time Computing for Referring Multimodal Large Language Models
- URL: http://arxiv.org/abs/2602.19505v1
- Date: Mon, 23 Feb 2026 04:42:10 GMT
- Title: Test-Time Computing for Referring Multimodal Large Language Models
- Authors: Mingrui Wu, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Zhiyuan Liu, Liujuan Cao, Ming-Ming Cheng, Rongrong Ji,
- Abstract summary: We propose ControlMLLM++, a novel test-time adaptation framework.<n>It injects learnable visual prompts into frozen multimodal large language models.
- Score: 143.49848714354698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.
Related papers
- Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification [22.871255950998016]
We introduce a novel framework for inference-time visual tokens scaling that enables MLLMs to perform verifier-guided reasoning over visual content.<n>Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks.<n>These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
arXiv Detail & Related papers (2025-06-08T17:38:49Z) - SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization [49.931663904599205]
Researchers have developed techniques to develop Large Multimodal Models with In-Context Learning capabilities.
Existing LMMs fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns.
We propose Symbol Demonstration Direct Preference Optimization (SymDPO) to break the traditional paradigm of constructing multimodal demonstrations.
arXiv Detail & Related papers (2024-11-17T08:29:14Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)<n>We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.<n>Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models [70.25499865569353]
We introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert.
Our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench.
arXiv Detail & Related papers (2024-03-20T09:42:43Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.