Fugu-MT 論文翻訳(概要): Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

論文の概要: Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

arxiv url: http://arxiv.org/abs/2606.02274v2
Date: Sat, 06 Jun 2026 05:27:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:04.782255
Title: Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning
Title（参考訳）: Dexterity-BEV: 汎用ロボット政策学習のための3D世界と行動
Authors: Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang, Jian Chen, Wenlve Zhou, Sheng Xu, Shumin Li, Kangyi Guo, Shichen Xu, Zixin Huang, Yongyi Su, Kui Jia,
Abstract要約: エンドツーエンドの操作ポリシーは、汎用的で巧妙なロボット操作を約束することを示している。 2Dファンデーションモデルから2つの重要な制限を継承する。これらの問題に対処するために、一連のコントリビューションを提示します。
参考スコア（独自算出の注目度）: 51.799524981291235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.
Abstract（参考訳）: エンド・ツー・エンドの操作ポリシーとWebスケールで事前訓練されたビジョン・ランゲージ・モデル(VLM)が組み合わさって、汎用的で器用なロボット操作の可能性を示している。しかし、2Dファンデーションモデルから2つの重要な制限を継承する。 1)本質的な操作の3D特性を無視した2次元RGB入力への依存 2) 入力出力空間間の空間的3次元アライメントの欠如, 多様なロボットエボディメント, カメラ設定, 軌跡データセットの多様さについて検討した。本稿では,これらの問題に対処するための一連のコントリビューションを紹介する。まず、カメラキャリブレーションとオプションの深さを用いて、2次元の視覚入力を3Dに高めるピクセルワイド3D表現である、アライメント頂点マップと頂点スペクトルを導入する。この新規な入力表現は、2次元大規模VLMの一般化と3次元認識を結合する。そこで我々は,各カメラビューとロボット動作の画素ごとの3D情報を共有座標に表現することで,操作ポリシーの入力と出力を調整することを提案する。そこで我々は,標準的なBird's-Eye-View(BEV)アライメントフレームを設計し,BEV画像の構築を革新的に提案する。大規模なトレーニングと評価を可能にするため,このようなアライメントを行うための包括的データ処理パイプラインを開発し,多様なロボット,人間オペレータ,データセットを対象としたトラジェクトリのための新しい時間的アライメント方式を導入する。これらのコントリビューションは、入力と出力の時空間的不一致を緩和し、現実世界の操作の一貫性と一般化を改善する。事前トレーニングされたチェックポイント、ソースコード、およびデータ処理パイプラインはhttps://hnuzhy.github.io/projects/Dex-BEVで利用可能である。

論文の概要: Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

関連論文リスト