Fugu-MT 論文翻訳(概要): Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

論文の概要: Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

arxiv url: http://arxiv.org/abs/2606.16178v1
Date: Mon, 15 Jun 2026 03:49:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.06227
Title: Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks
Title（参考訳）: 長期作業におけるビジュモータポリシーの短期記憶のスケーリング
Authors: Rutav Shah, Rajat Kumar Jenamani, Xiaohan Zhang, Lingfeng Sun, Roberto Martín-Martín, Yuke Zhu, Deva Ramanan, Karl Schmeckpeper,
Abstract要約: 本稿では,短期記憶を利用するビジュモータポリシーのためのトランスフォーマーベースのアーキテクチャであるPRISMを提案する。海賊フィルタは情報を取得し、関連する詳細を抑え、性能を向上する。階層アーキテクチャは、ローカル情報をコンパクトなトークンに圧縮し、計算とメモリフットプリントを改善する。
参考スコア（独自算出の注目度）: 69.19366746169906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory
Abstract（参考訳）: 多くのロボットタスクには短期記憶が必要で、もはや見えない物体を回収したり、一定期間後にアプライアンスをオフにしたりします。しかし、模倣学習を通じて訓練されたほとんどの自覚的政策は、過去の経験を駆使して意思決定を導くことなく、即時感覚入力にのみ依存する。 PRISMは,2つのキーコンポーネントを介して短期記憶を効果的に利用する,ビジュモータポリシーのためのトランスフォーマーベースのアーキテクチャである。一履歴と現在の行動予測の急激な相関を低減し、検索した情報をフィルタリングして無関係な詳細を抑えること。 (i)まずローカル情報をコンパクトなトークンに圧縮し、それを統合して時間的に拡張された依存関係をキャプチャし、計算とメモリフットプリントを改善する階層アーキテクチャ。これらの機構により、最大2分間のビジュモータポリシーで短期記憶を拡張できる。ビジュモータ制御においてメモリを体系的に評価するために,シロ化されたタスク固有のソリューションではなく,一般的なメモリ機構を育むように設計された,短期記憶の4つのカテゴリにまたがる8種類の家庭用操作タスクのベンチマークであるReMemBenchを紹介した。 PRISMは、リカレントアーキテクチャ、トランスフォーマー、およびそれらの変種を含む以前の作業よりも一貫して優れており、最強のベースラインに対して5%から12%の絶対的な改善を実現している。 RoboCasa と LIBERO のベンチマークでは、大規模な事前トレーニングを使わずに、無メモリの変種と GR00T-N1-3B や OpenVLA などの微調整されたビジョン・ランゲージ・アクションベースラインに対して、11%--15% の絶対的な改善を実現している。 PRISMとReMemBenchは共に、長期的タスクにスケールする短期記憶増強型ビズモータポリシーの開発と評価の基礎を確立した。追加資料はhttps://shahrutav.github.io/short-term-Memoryで公開されている。

論文の概要: Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

関連論文リスト