Fugu-MT 論文翻訳(概要): Geometric Action Model for Robot Policy Learning

論文の概要: Geometric Action Model for Robot Policy Learning

arxiv url: http://arxiv.org/abs/2606.17046v2
Date: Mon, 22 Jun 2026 21:44:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.21261
Title: Geometric Action Model for Robot Policy Learning
Title（参考訳）: ロボット政策学習のための幾何学的行動モデル
Authors: Jisang Han, Seonghu Jeon, Jaewoo Jung, René Zurbrügg, Honggyu An, Tifanny Portela, Marco Hutter, Marc Pollefeys, Seungryong Kim, Sunghwan Hong,
Abstract要約: 汎用ロボットポリシーは、オブジェクト、カメラ、ロボットアクションが3D物理世界でどのように相互作用するかを推論しながら、ユーザーの指示に従う必要がある。最近の視覚言語行動モデル(VLA)とビデオ世界行動モデル(WAM)は、大規模基盤モデルから強い意味や時間的先行を継承する。本稿では,言語条件の操作ポリシーであるGeometric Action Model (GAM)を提案する。
参考スコア（独自算出の注目度）: 68.6657929619782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
Abstract（参考訳）: 汎用ロボットポリシーは、オブジェクト、カメラ、ロボットアクションが3D物理世界でどのように相互作用するかを推論しながら、ユーザーの指示に従う必要がある。近年の視覚言語アクションモデル(VLA)とビデオワールドアクションモデル(WAM)は、大規模な基盤モデルから強力なセマンティックまたは時間的先行を継承しているが、それらは主に2D画像フレームや2D由来の潜伏空間で機能し、コンタクトリッチな操作に必要な3D幾何学を暗黙的に残している。本稿では,事前学習された幾何学的基礎モデル(GFM)を認識,時間的予測,行動復号化のための共有基板として直接活用する言語条件付き操作モデルであるGeometric Action Model (GAM)を提案する。 GAMは、GFMを中間層で分割し、浅い層は観察エンコーダとして機能し、分割層に挿入された因果先予測器は、言語、プロプレセプション、アクション履歴に基づいて、将来の潜伏トークンを予測する。予測された将来のトークンは、特徴伝播と復号のために残りのGFMブロックにルーティングされ、単一のバックボーンが将来の幾何学とアクションの両方を生成する。この設計は、GFMに言語条件の時間的世界モデリングと、その豊富な幾何学的前提を保ちながら、最小限のアーキテクチャ修正を施した。シミュレーションと実ロボット操作ベンチマークの幅広いスイートの中で、GAMは現在の基礎モデルスケールのベースラインよりも正確で、より堅牢で、より高速で、軽量である。

論文の概要: Geometric Action Model for Robot Policy Learning

関連論文リスト