Fugu-MT 論文翻訳(概要): Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

論文の概要: Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

arxiv url: http://arxiv.org/abs/2604.12908v1
Date: Tue, 14 Apr 2026 15:57:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.546412
Title: Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Title（参考訳）: ロボットマニピュレーションは視覚と幾何学のマッピング(f(v) \rightarrow G$):言語とビデオモデルに対する視覚と幾何学のバックボーン
Authors: Zijian Song, Qichang Li, Jiawei Zhou, Zhenlong Yuan, Tianshui Chen, Liang Lin, Guangrun Wang,
Abstract要約: 一般化可能なロボット制御の基礎は、広く採用されている視覚言語やビデオモデルではなく、視覚幾何学のバックボーンであるべきだと我々は主張する。本研究では,事前訓練されたネイティブ3次元表現上でのアクション生成を直接条件付きで行うビジョン・ジオメトリ・アクション・モデルを提案する。具体的には、VGAは従来の言語やビデオのバックボーンを事前訓練された3Dワールドモデルに置き換え、シームレスな視覚と幾何学のマッピングを確立する。
参考スコア（独自算出の注目度）: 65.05130114320734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $π_{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $π_{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.
Abstract（参考訳）: ロボット操作は、視覚と幾何学のマッピング(f(v) \rightarrow G$)の問題である。物理行動は3次元の位置や空間的関係のような幾何学的性質によって根本的に定義される。したがって、一般化可能なロボット制御の基礎は、広く採用されている視覚言語やビデオモデルではなく、視覚幾何学のバックボーンであるべきだと論じる。従来のVLAとビデオ予測モデルは、大規模な2D画像テキストまたは時間画素データに基づいて事前訓練されたバックボーンに依存している。効果はあるものの、それらの表現は意味論的概念や2D先行概念によって大きく形作られており、物理的操作に必要な正確な3D幾何学的な性質とは本質的に一致しない。この知見に基づいて,事前学習されたネイティブ3次元表現上でのアクション生成を直接条件付きで行うビジョン・ジオメトリ・アクション(VGA)モデルを提案する。具体的には、VGAは従来の言語やビデオのバックボーンを事前訓練された3Dワールドモデルに置き換え、視覚入力を直接物理的なアクションに変換するシームレスな視覚と幾何学のマッピングを確立する。幾何整合性をさらに向上するため,プログレッシブボリューム変調モジュールを導入し,共同トレーニング戦略を採用した。大規模な実験により、我々のアプローチの有効性が検証された。シミュレーションベンチマークでは、VGAは、$π_{0.5}$やGeoVLAといった最上位のVLAベースラインよりも優れており、正確な操作においてその優位性を示している。さらに重要なことに、VGAは実世界の展開において目に見えない視点に顕著なゼロショットの一般化を示し、一貫して$π_{0.5}$を上回っている。これらの結果は、言語や2Dビデオの事前翻訳よりも、ネイティブな3D表現の操作が、一般化可能な物理的知性を達成する上で非常に有望な方向であることを強調している。

論文の概要: Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

関連論文リスト