Fugu-MT 論文翻訳(概要): Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

論文の概要: Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

arxiv url: http://arxiv.org/abs/2509.26251v1
Date: Tue, 30 Sep 2025 13:41:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.150666
Title: Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
Title（参考訳）: 空間と運動を見る:VLAにおける空間的・動的認識による潜在行動の促進
Authors: Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, Ruqi Huang,
Abstract要約: Latent Action Models (LAMs) は、視覚言語制御システムにおいて、大規模な無注釈データからセマンティック・アクション・リセプションを学習することを可能にする。 Farsighted-LAMを提案する。これは幾何学的空間符号化とマルチスケール時間的モデリングを備えた潜在アクションフレームワークである。さらに,Farsighted-LAM上に構築されたエンドツーエンドVLAフレームワークであるSSM-VLAを提案する。
参考スコア（独自算出の注目度）: 21.362682837521632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Latent Action Models (LAMs) enable Vision- Language-Action (VLA) systems to learn semantic action rep- resentations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry- aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end- to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real- world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry- aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
Abstract（参考訳）: Latent Action Models (LAM) は、VLA(Vision-Language-Action)システムにおいて、大規模な無注釈データからセマンティックアクションのリセプションを学習できるようにする。しかし LAM のボトルネックは2つあります。 1) 一般に採用されているエンドツーエンドのイメージエンコーダは,空間的理解が不十分である。 2) 入力フレームが遠方にある場合, LAMは脆弱であり, 時間知覚に限界がある。このような要因は必然的に安定かつ明確な行動モデリングを妨げる。この目的のために,幾何認識型空間符号化とマルチスケール時間モデルを備えた潜在アクションフレームワークであるFarsighted-LAMを提案する。さらに,Farsighted-LAM上に構築されたエンドツーエンドのVLAフレームワークであるSSM-VLAを提案する。シミュレーションと実環境設定の両方で複数のVLAタスク上でSSM-VLAを検証し、最先端の性能を実現する。以上の結果から,図形を考慮したモデリング,時間的コヒーレンス,明示的推論を組み合わせる戦略が,具体的知能の堅牢性と一般化性の向上に有効であることが示唆された。

論文の概要: Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

関連論文リスト