Fugu-MT 論文翻訳(概要): GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

論文の概要: GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

arxiv url: http://arxiv.org/abs/2606.13394v1
Date: Thu, 11 Jun 2026 14:25:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.849115
Title: GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation
Title（参考訳）: GeoHAT: 移動マニピュレーションのための幾何適応ハイブリッド動作変換器
Authors: Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li,
Abstract要約: 全体移動操作には移動基地とマニピュレータの調整が必要である。我々は、シンプルな原理に基づいて構築された、エンドツーエンドの拡散ベースのフレームワークGeoHATを提案する。 ManiSkill-HABシミュレーションベンチマークの実験では、GeoHATが79.3%の成功率を達成した。
参考スコア（独自算出の注目度）: 6.488530751190965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.
Abstract（参考訳）: 全体移動操作では、移動体ベースとマニピュレータを移動視点で調整し、幾何学的知覚と行動生成の課題を提起する必要がある。現在のポリシーは、密集した空間構造を欠いた2D特徴または疎らな3D表現に依存し、通常、それぞれの制御要求を無視した1つのアクションベクトル内で腕とベースをエンコードする。さらに、既存の高密度核融合戦略は、重い計算オーバーヘッドを発生させながら、ノイズの深い深さ下で事前訓練された表現を破損させるリスクを負う。 GeoHATは、単純な原則に基づいて構築された、エンドツーエンドの拡散ベースのフレームワークである。 GeoHATは軽量のFourier空間エンコーダを採用しており、3Dビジョンバックボーンを追加せずに、高密度の3D座標を幾何学的トークンにマッピングする。これらのトークンは、奥行きの妥当性によって変調されたトークンごとのゲート融合によって視覚基礎モデルの特徴に選択的に注入され、空間的理解を豊かにしながら、その意味を前もって保存する。アクション生成では、Hybrid Whole-Body Action Decoderがアームとベースを別々のサブスペースに分解し、各アクションのモダリティがタスク関連ヴィジュアルコンテキストにスパースなクロスアテンションを通して参加できるようにする。 ManiSkill-HABシミュレーションベンチマークの実験では、GeoHATの平均成功率は79.3%で、最強のベースラインを23.7%上回った。さらに、多様なタスクに関する実世界の実験では、すべてのベースラインに対する一貫した改善も確認されている。

論文の概要: GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

関連論文リスト