Fugu-MT 論文翻訳(概要): ST4VLA: Spatially Guided Training for Vision-Language-Action Models

論文の概要: ST4VLA: Spatially Guided Training for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2602.10109v1
Date: Tue, 10 Feb 2026 18:59:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.342429
Title: ST4VLA: Spatially Guided Training for Vision-Language-Action Models
Title（参考訳）: ST4VLA:視覚・言語・行動モデルのための空間的指導型トレーニング
Authors: Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang,
Abstract要約: 大規模視覚言語モデル(VLM)はマルチモーダル理解において優れるが、具体化されたタスクに拡張されると不足する。本稿では,動作学習と空間的先行時間との整合性を実現するための2元系ビジョン・ランゲージ・アクション・フレームワークST4VLAを紹介する。 ST4VLAは、Google Robotでは66.1 -> 84.6、WidowX Robotでは54.7 -> 73.2、バニラVLAよりも大幅に改善されている。
参考スコア（独自算出の注目度）: 80.35847468618276
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/
Abstract（参考訳）: 大規模視覚言語モデル(VLM)はマルチモーダル理解において優れるが、具体化タスクに拡張されると不足し、命令は低レベルなモーターアクションに変換されなければならない。本稿では,VLMにおける行動学習を空間的先行と整合させるための空間ガイドトレーニングを活用する,デュアルシステム・ビジョン・ランゲージ・アクション・フレームワークST4VLAを紹介する。 ST4VLAには2つのステージがある。一空間接地事前訓練であって、Webスケール及びロボット固有のデータから、スケーラブルな点、ボックス、軌道予測を介して、VLMに転送可能な先行情報を装備すること。 (II) 空間的に誘導された行動後訓練は, よりリッチな空間事前生成を奨励し, 空間的プロンプトを通して行動生成を誘導する。この設計は、政策学習中の空間的接地を保存し、空間的および行動的目的に対して一貫した最適化を促進する。 ST4VLAは、Google Robotでは66.1 -> 84.6、WidowX Robotでは54.7 -> 73.2から、バニラVLAよりも大幅に改善され、SimplerEnvでは新たな最先端の結果が確立された。また、未確認オブジェクトやパラフレーズ命令への強い一般化や、現実世界の設定における長い水平摂動に対する堅牢性も示している。これらの結果は、堅牢で一般化可能なロボット学習のための有望な方向として、スケーラブルな空間ガイド付きトレーニングを強調している。ソースコード、データ、モデルはhttps://internrobotics.github.io/internvla-m1.github.io/でリリースされる。

論文の概要: ST4VLA: Spatially Guided Training for Vision-Language-Action Models

関連論文リスト