Fugu-MT 論文翻訳(概要): G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

論文の概要: G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2606.24472v1
Date: Tue, 23 Jun 2026 12:02:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.937212
Title: G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models
Title（参考訳）: G$^3$VLA:ビジョン・ランゲージ・アクションモデルに対する幾何学的帰納バイアス
Authors: Yue Peng, Yongzhe Zhao, Artur Habuda, Khuyen Pham, Yanheng Zhu, Tran Nguyen Le, Fares Abu-Dakka, Li Guo,
Abstract要約: 視覚言語アクション(VLA)モデルは、汎用ロボット操作において急速に進歩した。それらの視覚トークンは、ロボットのカメラのキャリブレーションされた形状ではなく、2D画像座標に基づいている。予め訓練されたVLAの視覚的ストリームに校正された構造を注入するカメラ対応幾何モジュールであるG$3$VLAを提案する。
参考スコア（独自算出の注目度）: 3.704517635293094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language-action (VLA) models have made rapid progress in generalist robot manipulation by harnessing semantic knowledge from pretrained vision-language backbones, but their visual tokens remain grounded in 2D image coordinates rather than the calibrated geometry of the robot's cameras -- a mismatch especially pronounced in multi-camera setups, where views are coupled by known intrinsics and extrinsics yet processed as independent images. We propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of a pretrained VLA without altering its action space or imitation objective, combining intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion. Geometric supervision is provided either from ground-truth point maps when available, or from confidence-gated $π^3$X teacher predictions, requiring no depth sensors or manual annotations. Instantiated on $π_0$, G$^3$VLA yields consistent gains across the LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings, with the largest improvements on spatially and object-sensitive tasks. We further validate on $π_{0.5}$ and GR00T 1.5, with results suggesting that geometric transfer is most effective when geometry-aware tokens have direct access to the action generation pathway. Our project page is at https://sites.google.com/view/g3vla
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、事前訓練された視覚言語バックボーンからのセマンティック知識を活用することで、一般的なロボット操作を急速に進歩させたが、それらの視覚トークンは、ロボットのカメラのキャリブレーションされた幾何学ではなく、2D画像座標に基礎を置いている。 G$^3$VLA, カメラ対応幾何モジュール, キャリブレーションされた構造を予め訓練したVLAの視覚的流れに注入し, 動作空間や模倣目的を変更することなく, 固有条件の光線埋め込み, 射影位置符号化(PRoPE), 双方向視野融合を組み合わせて提案する。幾何的監督は、利用可能な地平線点図から、あるいは、深度センサーや手動のアノテーションを必要とせず、自信に満ちた$π^3$Xの教師予測から提供される。 π_0$に設定されたG$^3$VLAは、LIBEROスイート、RoboCasa24、RoboTwin2.0、および実ロボット設定で一貫した利得を得る。我々はさらに$π_{0.5}$とGR00T 1.5を検証し、幾何学的トークンがアクション生成経路に直接アクセスする場合、幾何移動が最も効果的であることが示唆された。私たちのプロジェクトページはhttps://sites.google.com/view/g3vlaです。

論文の概要: G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

関連論文リスト