Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
- URL: http://arxiv.org/abs/2602.23959v1
- Date: Fri, 27 Feb 2026 12:04:07 GMT
- Title: Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
- Authors: Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang,
- Abstract summary: We propose a framework that enables MLLMs to reason over images using continuous numerical coordinates.<n> NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space.<n>Experiments on three benchmarks demonstrate that NV-CoT significantly improves localization precision and final answer accuracy.
- Score: 55.65577137924979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.
Related papers
- Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z) - Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z) - GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing [50.961694646995376]
We propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP)<n>GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid.<n>Experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods.
arXiv Detail & Related papers (2026-01-23T10:12:59Z) - GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models [23.159388800893964]
We argue that alignment is most effective when both modalities share a unified geometric basis.<n>We employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage.<n>Our framework achieves a 20% performance improvement over current state-of-the-art methods.
arXiv Detail & Related papers (2026-01-12T15:14:29Z) - CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer [48.52152634356309]
We propose a correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views.<n> CLNet decomposes the view alignment process into three learnable and complementary modules.<n>Our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
arXiv Detail & Related papers (2025-12-16T16:31:41Z) - SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment [8.657941729790599]
We introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity.<n>Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches.<n>Experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance.
arXiv Detail & Related papers (2025-11-03T09:41:32Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization [19.70803794316208]
Medical Image Grounding (MIG) involves localizing specific regions in medical images based on textual descriptions.<n>Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations.<n>We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations.
arXiv Detail & Related papers (2025-07-01T21:51:42Z) - VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought [51.43082554363725]
We introduce textbfVLM-R$3$ (textbfVisual textbfLanguage textbfModel with textbfRegion textbfRecognition and textbfReasoning), a framework that equips an MLLM with the ability to decide emph when additional visual evidence is needed.<n>Experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$3$ sets a new
arXiv Detail & Related papers (2025-05-22T03:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.