Fugu-MT 論文翻訳(概要): Towards Generalizable Robotic Manipulation in Dynamic Environments

論文の概要: Towards Generalizable Robotic Manipulation in Dynamic Environments

arxiv url: http://arxiv.org/abs/2603.15620v1
Date: Mon, 16 Mar 2026 17:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.731446
Title: Towards Generalizable Robotic Manipulation in Dynamic Environments
Title（参考訳）: 動環境における汎用ロボットマニピュレーションを目指して
Authors: Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai,
Abstract要約: VLA(Vision-Language-Action)モデルは静的な操作では優れているが、動いた軌跡を持つ動的環境では困難である。本稿では、大規模なデータセットと一般的な動的操作のためのベンチマークであるDOMINOを紹介する。また、動的に認識可能なVLAアーキテクチャPUMAを提案する。
参考スコア（独自算出の注目度）: 48.77270850350943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは静的な操作では優れているが、動いたターゲットを持つ動的環境では苦労する。このパフォーマンスギャップは主に、動的操作データセットの不足と、単一フレームの観測にメインストリームのVLAが依存していることに起因し、時空間の推論能力を制限している。これを解決するために、DOMINO、大規模データセットと一般化可能な動的操作のためのベンチマークを導入し、階層的な複雑度を持つ35のタスク、110K以上の専門家軌道、多次元評価スイートを紹介した。総合的な実験を通じて、動的タスク上で既存のVLAを体系的に評価し、動的認識のための効果的なトレーニング戦略を探究し、動的データの一般化可能性を検証する。さらに,動的に認識可能なVLAアーキテクチャPUMAを提案する。シーン中心の歴史的光の流れと特殊世界クエリを統合して、オブジェクト中心の将来の状態を暗黙的に予測することで、PUMAは短水平予測と履歴認識を結合する。その結果、PUMAは最先端の性能を達成し、ベースラインよりも6.3%の成功率を絶対的に向上することが示された。さらに,動的データを用いたトレーニングにより,静的なタスクに遷移する時空間表現が頑健になることを示す。すべてのコードとデータはhttps://github.com/H-EmbodVis/DOMINOで入手できる。

論文の概要: Towards Generalizable Robotic Manipulation in Dynamic Environments

関連論文リスト