Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering
Abstract Overview
This paper addresses multi-region visual referring in large multimodal models (LMMs), where multiple marked regions must be interpreted jointly, sometimes requiring global scene context. The authors propose Contextual Latent Steering (CSteer), a training-free method that pre-computes contextual steering vectors from contrastive examples and applies representation editing at inference time without fine-tuning or architectural modifications. The vectors encode behaviors useful for referring, such as distinguishing among multiple marked regions and incorporating broader contextual cues. Experiments evaluate the method on GAR-Bench, INST-IT, VIP-Bench, and BLINK benchmarks, with ablations on vector construction, layer selection, steering decomposition, and data scale.
Novelty
The key contribution is a training-free method to improve multi-region visual referring in general LMMs without adding a region encoder, fine-tuning, or architectural modification. The approach builds steering vectors from incorrect model rollouts paired with LLM-judge-corrected referential rewrites, then applies decomposed steering to query tokens at early layers and marker tokens during decoding at middle-to-late layers.
Results
CSteer consistently improves strong general LMM baselines over Set-of-Mark prompting across multiple benchmarks. On Qwen3-VL-8B, it raises INST-IT image open-ended performance from 78.5 to 80.4 and video multiple-choice from 58.2 to 60.1, improves GAR-Bench OE from 52.5 to 57.4, and increases VIP-Bench average from 71.5 to 74.7 and BLINK from 55.9 to 57.5. Ablations confirm that rewrite-based vector construction and decomposed steering are the most effective design choices.
Key Points
- CSteer targets a specific weakness of general LMMs: referring to multiple marked regions simultaneously, especially when correct answers depend on contextual scene understanding rather than isolated object recognition.
- The method derives contextual steering vectors from contrastive hidden-state differences, with the strongest variant using false rollouts paired with LLM-corrected rewrites to capture referential corrections.
- Ablations show that early-layer in-query steering and mid-to-late-layer decoding steering play complementary roles, and that gains are consistent across data scales (32 to 1024 samples), input domains (images and videos), and prompting formats (points, boxes, numerical identifiers).