Fugu-MT 論文翻訳(概要): EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

論文の概要: EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

arxiv url: http://arxiv.org/abs/2606.20092v1
Date: Thu, 18 Jun 2026 11:11:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.812635
Title: EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
Title（参考訳）: EventVLA: 長距離ビジョンランゲージ・アクションポリシーのためのイベント駆動型ビジュアルエビデンスメモリ
Authors: Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang,
Abstract要約: EventVLAは、疎視的エビデンスメモリの概念に基づいて開発されたエンドツーエンドフレームワークである。 KEMは、VLAの潜伏した埋め込みから将来の確率を直接予測し、スパースでタスククリティカルな視覚イベントを自律的にキャプチャして保存する。対話型視覚的エビデンスで非マルコフ操作タスクを評価するための診断ベンチマークであるRoboTwin-MeMを提案する。
参考スコア（独自算出の注目度）: 68.812675280427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
Abstract（参考訳）: VLA(Vision-Language-Action)ポリシーは、タスク関連キューが時間が経つにつれて無視されるか、観測不能になる場合が多いため、長期にわたるロボット操作においてメモリは依然として重要なボトルネックとなっている。既存のメモリ拡張方式は歴史的コンテキストを利用するが、深刻な情報ボトルネックに悩まされるか、分離されたデュアルシステムを介して高いレイテンシーを発生させるか、巨大な視覚的冗長性を蓄積する非選択バッファに依存するかのいずれかである。これらの制限に対処するため、EventVLAはスパースな視覚的エビデンスメモリの概念に基づいて構築されたエンドツーエンドフレームワークであり、初期および短期のコンテキストを維持するための基本的な視覚的アンカーと、動的キーフレームエビデンスメモリ(KEM)モジュールである。具体的には、KEMはVLAの潜伏した埋め込みから、疎結合でタスククリティカルな視覚イベントを自律的にキャプチャして保存する将来のキーフレーム確率を直接予測する。このフォレスト駆動機構により、現在の観測の今後の因果的有用性を動的に評価し、観察不能になる前に過渡的な視覚的証拠を保存することができる。さらに,対話型視覚的エビデンスを用いた非マルコフ操作タスクの評価に特化して設計された診断ベンチマークであるRoboTwin-MeMを提案する。大規模な評価では、17のメモリ要求シミュレーションタスクと4つの実世界のバイマニュアルタスクにまたがって、EventVLAは、最先端のメモリ拡張VLAよりも平均成功率を+40%向上させる。

論文の概要: EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

関連論文リスト