FuguReport

Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering

Authors Yun Xing, Hanyuan Liu, Jiahao Nie, Shijian Lu
Affiliations Nanyang Technological University
Categories Method / Latent Variable Methods / Training-free contextual steering approach, Application / Visual Grounding / Region referring in multimodal models, Evaluation / Model Evaluation / Performance comparison of contextual steering
License CC BY-SA 4.0

Abstract Overview

This paper addresses multi-region visual referring in large multimodal models (LMMs), where multiple marked regions must be interpreted jointly, sometimes requiring global scene context. The authors propose Contextual Latent Steering (CSteer), a training-free method that pre-computes contextual steering vectors from contrastive examples and applies representation editing at inference time without fine-tuning or architectural modifications. The vectors encode behaviors useful for referring, such as distinguishing among multiple marked regions and incorporating broader contextual cues. Experiments evaluate the method on GAR-Bench, INST-IT, VIP-Bench, and BLINK benchmarks, with ablations on vector construction, layer selection, steering decomposition, and data scale.

Novelty

The key contribution is a training-free method to improve multi-region visual referring in general LMMs without adding a region encoder, fine-tuning, or architectural modification. The approach builds steering vectors from incorrect model rollouts paired with LLM-judge-corrected referential rewrites, then applies decomposed steering to query tokens at early layers and marker tokens during decoding at middle-to-late layers.

Results

CSteer consistently improves strong general LMM baselines over Set-of-Mark prompting across multiple benchmarks. On Qwen3-VL-8B, it raises INST-IT image open-ended performance from 78.5 to 80.4 and video multiple-choice from 58.2 to 60.1, improves GAR-Bench OE from 52.5 to 57.4, and increases VIP-Bench average from 71.5 to 74.7 and BLINK from 55.9 to 57.5. Ablations confirm that rewrite-based vector construction and decomposed steering are the most effective design choices.

Key Points

  1. CSteer targets a specific weakness of general LMMs: referring to multiple marked regions simultaneously, especially when correct answers depend on contextual scene understanding rather than isolated object recognition.
  2. The method derives contextual steering vectors from contrastive hidden-state differences, with the strongest variant using false rollouts paired with LLM-corrected rewrites to capture referential corrections.
  3. Ablations show that early-layer in-query steering and mid-to-late-layer decoding steering play complementary roles, and that gains are consistent across data scales (32 to 1024 samples), input domains (images and videos), and prompting formats (points, boxes, numerical identifiers).

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.