Related papers: VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning

VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning

URL: http://arxiv.org/abs/2509.25151v1
Date: Mon, 29 Sep 2025 17:54:04 GMT
Title: VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
Authors: Zhaozhi Wang, Tong Zhang, Mingyue Guo, Yaowei Wang, Qixiang Ye,
Abstract summary: VideoAnchor is a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining.<n>We show consistent performance gains on benchmarks with InternVL2-8B and Q2.5VL-72B.<n>Our codes will be made public at https://github.com/feufhd/VideoAnchor.
Score: 69.64660280965971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains -- $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B -- while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.

Related papers

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence [10.889641815961133]
spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or MLLMs with black-box reconstruction modules.<n>We present EagleVision, a framework for progressive spatial cognition through macro perception and micro verification.
arXiv Detail & Related papers (2025-12-17T07:51:36Z)
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding [43.63398524449102]
Humans perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process.<n>We propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass.<n>Blink balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently.
arXiv Detail & Related papers (2025-12-11T11:27:25Z)
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens [54.18057944158818]
Chain-of-Visual-Thought (COVT) is a framework that enables Vision-Language Models (VLMs) to reason through continuous visual tokens.<n>Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts.<n>During training, the VLM with COVT autoregressively predicts visual tokens to reconstruct dense supervision signals.
arXiv Detail & Related papers (2025-11-24T18:55:19Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs [28.998923104606614]
DisCo is a visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs.<n>DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks.
arXiv Detail & Related papers (2025-07-14T14:05:19Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models [18.02840698188587]
We propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2.<n>Our image-only alignment fine-tuning exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning.
arXiv Detail & Related papers (2025-06-03T07:44:43Z)
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Redundancy-aware Transformer for Video Question Answering [71.98116071679065]
We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
arXiv Detail & Related papers (2023-08-07T03:16:24Z)
Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives. We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.