Fugu-MT 論文翻訳(概要): LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

論文の概要: LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

arxiv url: http://arxiv.org/abs/2604.08475v1
Date: Thu, 09 Apr 2026 17:14:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.041979
Title: LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Title（参考訳）: LAMP: オープンワールドのマニピュレーションに先駆けた3D画像編集
Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang,
Abstract要約: LAMPは,物体間3次元変換を連続的かつ幾何学的に認識した表現として抽出するために,画像編集を3次元先行として引き上げる。私たちの重要な洞察は、画像編集は本質的にリッチな2次元空間的手がかりを符号化し、これらの暗黙の手がかりを3次元変換に引き上げることで、オープンワールド操作のためのきめ細かい正確なガイダンスを提供するということです。
参考スコア（独自算出の注目度）: 33.021510263749455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
Abstract（参考訳）: オープンワールドにおける人間のような一般化は、ロボット操作の根本的な課題である。強化学習、模倣学習、視覚言語アクションモデル(VLA)など、既存の学習ベースの手法は、しばしば新しいタスクや目に見えない環境に苦しむ。もう一つの有望な方向は、オープンワールド操作のためのきめ細かい空間的および幾何学的関係を捉える一般化可能な表現を探索することである。大規模言語モデル(LLM)と視覚言語モデル(VLM)は、言語や注釈付き2D表現に基づく強力な意味論的推論を提供するが、それらの3D認識は、微粒な操作への適用性を制限している。そこで本研究では,画像編集を3次元先行として引き上げ,オブジェクト間3次元変換を連続的・幾何学的表現として抽出するLAMPを提案する。私たちの重要な洞察は、画像編集は本質的にリッチな2次元空間的手がかりを符号化し、これらの暗黙の手がかりを3次元変換に引き上げることで、オープンワールド操作のためのきめ細かい正確なガイダンスを提供するということです。大規模な実験では、 \codenameは正確な3D変換を提供し、オープンワールド操作において強力なゼロショット一般化を実現する。プロジェクトページ: https://zju3dv.github.io/LAMP/。

論文の概要: LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

関連論文リスト