VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
- URL: http://arxiv.org/abs/2509.21100v1
- Date: Thu, 25 Sep 2025 12:46:46 GMT
- Title: VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
- Authors: Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang,
- Abstract summary: Visual Test-Time Scaling (VTTS) is a novel approach to enhance MLLMs' reasoning via iterative inference during inference.<n>VTTS mimics humans' attention by focusing on high-confidence hierarchical-temporal regions, guided by updated textual predictions.<n>Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5%.
- Score: 50.446538409259524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - Vision-aligned Latent Reasoning for Multi-modal Large Language Model [82.26044667101011]
Vision-aligned Latent Reasoning (VaLR) is a framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step.<n>VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders.
arXiv Detail & Related papers (2026-02-04T12:04:02Z) - Unleashing Perception-Time Scaling to Multimodal Reasoning Models [60.578179197783754]
Recent advances in inference-time scaling have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear.<n>We propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems.
arXiv Detail & Related papers (2025-10-10T03:17:52Z) - Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification [22.871255950998016]
We introduce a novel framework for inference-time visual tokens scaling that enables MLLMs to perform verifier-guided reasoning over visual content.<n>Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks.<n>These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
arXiv Detail & Related papers (2025-06-08T17:38:49Z) - Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward [77.34936657745578]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z) - SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs [74.2538340966038]
We investigate how Multimodal Large Language Models (MLLMs) process visual inputs by analyzing their attention mechanisms.<n>We reveal a surprising sparsity phenomenon: only a small subset of attention heads in LLMs actively contribute to visual understanding.<n>We introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores.
arXiv Detail & Related papers (2025-06-05T17:59:55Z) - VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [42.316341452766075]
This paper aims to enhance video perception with Reinforcement Fine-temporalning (RFT)<n>We develop VideoChat-R1, a powerful video MLLM that achieves state-the-art performance on-temporal tasks without sacrificing chat ability.<n>Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs.
arXiv Detail & Related papers (2025-04-09T15:09:27Z) - Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.<n>In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.