Fugu-MT 論文翻訳(概要): R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

論文の概要: R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

arxiv url: http://arxiv.org/abs/2603.14498v1
Date: Sun, 15 Mar 2026 17:30:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.852926
Title: R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation
Title（参考訳）: R3DP: 身体操作のためのリアルタイム3D認識ポリシー
Authors: Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao, Qiaochu Yang, Yaxing Lyu, Zhixuan Liang, Yibin Liu, Congsheng Xu, Xianda Guo, Wei Sui, Yaohui Jin, Xiaokang Yang, Yanyan Xu, Yao Mu,
Abstract要約: 本稿では,実時間性能を犠牲にすることなく,強力な3Dプリエントを操作ポリシーに統合するリアルタイム3D対応ポリシー(R3DP)を提案する。 R3DPは、より優れた結果を得るために、大規模な3Dプリエントを効果的に活用し、シングルビューとマルチビューDPを32.9%、平均成功率51.4%で上回っている。
参考スコア（独自算出の注目度）: 45.41467771053697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.
Abstract（参考訳）: 身体的操作は、オブジェクトの正確な3次元理解と、その空間的関係を計画し、コンタクトリッチなアクションを実行するために必要である。大規模3Dビジョンモデルは強力な先行性を提供するが、その計算コストはリアルタイム制御の遅延を禁止している。本稿では,実時間性能を犠牲にすることなく,強力な3Dプリエントを操作ポリシーに統合するリアルタイム3D対応ポリシー(R3DP)を提案する。 R3DPのコアとなるイノベーションは非同期の高速スローコラボレーションモジュールで、リアルタイムのパフォーマンスを損なうことなく、大規模な3Dプリエントをシームレスに統合する。このシステムは、未学習のスローシステム(VGGT)をスパースキーフレームのみに問い合わせ、同時に軽量な時間的特徴予測ネットワーク(TFPNet)を用いて全ての中間フレームの特徴を予測することで、リアルタイムの効率を維持する。 TFPNetは、履歴データを利用して時間的相関を利用して、一貫した特徴推定によってタスク成功率を明示的に向上する。さらに、より効果的なマルチビュー融合を実現するために、カメラの内在と外在を明示的に組み込むことで、ビューにまたがる機能を集約するMulti-View Feature Fuser (MVFF)を導入する。 R3DPは、大規模なモデルをリアルタイム推論システムに統合するためのプラグアンドプレイソリューションを提供する。視覚構成の異なる複数のベースラインに対してR3DPを評価する。 R3DPは、より優れた結果を得るために、大規模3Dプリエントを効果的に活用し、シングルビューとマルチビューDPをそれぞれ32.9%、平均成功率51.4%で上回っている。さらに、政策実行から重い3D推論を分離することにより、R3DPは、単純DP+VGGT統合と比較して44.8%の推論時間を短縮する。

論文の概要: R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

関連論文リスト