Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
- URL: http://arxiv.org/abs/2510.27606v1
- Date: Fri, 31 Oct 2025 16:30:08 GMT
- Title: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
- Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang,
- Abstract summary: We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images.<n>Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities.<n>Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
- Score: 93.19037653970622
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
Related papers
- MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence [50.11889361459544]
Humans are born with vision-based 4D spatial-temporal intelligence.<n>Despite its importance, this capability remains a significant bottleneck for current large language models (MLLMs)
arXiv Detail & Related papers (2026-02-28T07:23:36Z) - Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding [78.26501371437013]
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
arXiv Detail & Related papers (2026-02-15T16:40:33Z) - SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception [6.975054201075641]
Contact-rich robotic manipulation requires representations that encode local geometry.<n>Modern visuo-tactile sensors capture both modalities in a single fused image.<n>Most self-supervised learning frameworks compress feature maps into a global vector.
arXiv Detail & Related papers (2025-12-01T17:26:40Z) - The Path Not Taken: RLVR Provably Learns Off the Principals [85.41043469428365]
We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
arXiv Detail & Related papers (2025-11-11T18:49:45Z) - SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards [37.39035418889281]
We introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning.<n>The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards.
arXiv Detail & Related papers (2025-11-10T18:52:47Z) - Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries [23.825984868116716]
We introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning.<n>We leverage this controllable environment to train Vision-Language Models (VLMs) using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum.<n>Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%.
arXiv Detail & Related papers (2025-11-01T21:19:41Z) - One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z) - Offline RLAIF: Piloting VLM Feedback for RL via SFO [4.391505380846452]
Vision-Language Models (VLMs) are limited in their ability to solve control tasks due to their lack of action-conditioned training data.<n>A key challenge in Reinforcement Learning from AI Feedback is determining how best to integrate VLM-derived signals into the learning process.
arXiv Detail & Related papers (2025-03-02T23:52:46Z) - SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models [15.50826328938879]
We introduce SURDS, a benchmark designed to evaluate the spatial reasoning capabilities of vision language models (VLMs)<n>Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples.<n>We propose a reinforcement learning-based alignment scheme leveraging spatially grounded reward signals.
arXiv Detail & Related papers (2024-11-20T08:14:01Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.