Fugu-MT 論文翻訳(概要): Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

論文の概要: Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

arxiv url: http://arxiv.org/abs/2606.14535v1
Date: Fri, 12 Jun 2026 15:12:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.9538
Title: Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera
Title（参考訳）: 空間条件付き拡散政策:単一RGBカメラによる高精度・ロバスト操作の学習
Authors: Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim,
Abstract要約: 空間条件拡散政策(Spatially Conditioned Diffusion Policy, SCDP)は、単一カメラ環境における精密かつ堅牢な操作を実現する拡散型ビズモータ政策である。 SCDPは2つの重要なコンポーネントから構成される: (i) 広義のコンテキストときめ細かな視覚特徴の両方を捉えるマルチスケールの特徴マップを生成するビジュアルエンコーダ、 (ii) 拡散ループの中間端エフェクタ軌道に沿って点方向の特徴をサンプリングする空間条件モジュール。
参考スコア（独自算出の注目度）: 6.648702147742411
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.
Abstract（参考訳）: 近年の視覚模倣学習システムは、手首搭載カメラを用いたマルチカメラをデファクトスタンダードとして広く採用している。しかし、このポリシーは細かなインタラクションの詳細を捉え、局所的な手首ビューのないタスク関連領域を特定するべきであるため、単一のグローバルビューからの操作は依然として困難である。この課題に対処するため,単一カメラ環境での高精度かつ堅牢な操作を実現する拡散型ビジュモータ政策である空間条件拡散政策(SCDP)を提案する。私たちのキーとなる考え方は、エンドエフェクタ軌道はタスク関連領域を反映した視覚的アテンションアンカーとして機能する、ということです。このアイデアに基づいて、SCDPは2つの重要なコンポーネントから構成される。 (i)広義のコンテキストときめ細かい視覚的特徴の両方を捉えたマルチスケール特徴マップを作成するビジュアルエンコーダ (2)拡散ループの中間端エフェクター軌道に沿って点方向の特徴をサンプリングする空間調和モジュール。大規模なシミュレーション実験により、SCDPは強いシングルビューベースラインを一貫して上回り、マルチカメラベースラインに匹敵する性能を達成することが示された。実世界の実験はさらに、視覚的障害に対する正確な操作と堅牢性を示し、シングルカメラ模倣学習の可能性を強調している。

論文の概要: Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

関連論文リスト