Fugu-MT 論文翻訳(概要): Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

論文の概要: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.27606v1
Date: Fri, 31 Oct 2025 16:30:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:16.165868
Title: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Title（参考訳）: 空間SSRL:自己監督型強化学習による空間理解の促進
Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang,
Abstract要約: 本研究では,通常のRGBやRGB-D画像から直接検証可能な信号を導出する自己教師付きRLパラダイムである空間SSRLを紹介する。我々のタスクの訓練は、一般的な視覚能力を維持しながら空間的推論を大幅に改善する。以上の結果から,単純で本質的な監視がRLVRを大規模に実現し,LVLMの空間知能を高めるための実践的経路が示唆された。
参考スコア（独自算出の注目度）: 93.19037653970622
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
Abstract（参考訳）: 空間的理解はLVLM(Large Vision-Language Models)の弱点である。既存の教師付き微調整(SFT)と、検証可能な報酬(RLVR)パイプラインによる最近の強化学習は、コストのかかる監督、特殊なツール、スケールを制限する制約のある環境に依存している。本研究では,通常のRGBやRGB-D画像から直接検証可能な信号を導出する自己教師付きRLパラダイムである空間SSRLを紹介する。空間SSRLは、シャッフルパッチリオーダー、フリップパッチ認識、トリミングパッチインペイント、地域深度順序付け、相対的な3D位置予測という、2Dおよび3D空間構造をキャプチャする5つのプリテキストタスクを自動で定式化する。これらのタスクは、人間やLVLMアノテーションを必要とせず、検証しやすく、根本からの回答を提供する。我々のタスクの訓練は、一般的な視覚能力を維持しながら空間的推論を大幅に改善する。画像とビデオの両方における7つの空間理解ベンチマークでは、SSRLはQwen2.5-VLベースラインよりも平均精度が4.63%(3B)と3.89%(7B)向上している。以上の結果から,単純で本質的な監視がRLVRを大規模に実現し,LVLMの空間知能を高めるための実践的経路が示唆された。

論文の概要: Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

関連論文リスト