Fugu-MT 論文翻訳(概要): Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

論文の概要: Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

arxiv url: http://arxiv.org/abs/2605.20085v1
Date: Tue, 19 May 2026 16:39:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.530622
Title: Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
Title（参考訳）: エゴセントリックマニピュレーションのための空間的にプロンプトされた視覚軌跡予測
Authors: Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong,
Abstract要約: 空間的にプロンプトされた視覚軌跡予測(SP-VTP)の最初の形式化について述べる。この新しい設定は、初期空間的プロンプトを利用してタスク目標を定義し、エゴセントリックストリームから将来のエンドエフェクタ軌道を予測するモデルを実行する。本研究では,1フレームの視覚的および座標的空間的プロンプトのためのタスクエンコーダと,現在の視覚的および歴史的コンテキストのための観測エンコーダと,将来のエンドエフェクタ動作のための軌道生成器を組み合わせたSPOTを提案する。
参考スコア（独自算出の注目度）: 19.295853768161606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.
Abstract（参考訳）: ロボット操作は、しばしば言語命令やタスク識別子によって特定されるが、類似したオブジェクトを持つ散らかった環境は、移動すべき場所と配置する場所を空間的に示すことにより、より良く扱われる。対象と目標仕様の視覚中心的な課題に対処するために,我々は,空間的にプロンプトされた視覚的軌道予測(SP-VTP)の最初の形式化を,私たちの知る限りで提示する。この新しい設定は、初期空間的なプロンプト(バウンディングボックスやポイントなど)を使用してタスクの目的を定義し、エゴセントリックストリームから将来のエンドエフェクタ軌道を予測する。本研究では,エゴセントリックな空間的に誘導される操作軌跡のデータセットであるEgoSPTを1フレームオブジェクトと目標接地アノテーションを用いて収集・注釈し,3次元エンドエフェクタ動作を復元する。 SP-VTPはタスク仕様が静的であるのに対して、シーン構成は時間とともに進化するので、難しい。この問題を解決するため,SPOT(Spatially Prompted Object-Target Policy)を提案し,第1フレームの視覚的および座標空間的プロンプトのためのタスクエンコーダ,現在の視覚的および歴史的コンテキストのための観測エンコーダ,将来のエンドエフェクタ動作のための軌道生成器を提案する。厳密なシーンレベルの分割による実験は、SPOTが非プロンプトまたは単一ソースのベースラインよりもクロスシーンの軌道予測を改善することを示している。 EgoSPTとSPOTは共に、エゴセントリックな操作のためのシンプルでスケーラブルなタスク条件として、空間的プロンプト問題SP-VTPを確立する。

論文の概要: Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

関連論文リスト