Fugu-MT 論文翻訳(概要): MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

論文の概要: MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

arxiv url: http://arxiv.org/abs/2605.18652v1
Date: Mon, 18 May 2026 16:57:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.114642
Title: MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
Title（参考訳）: MementoGUI:長距離GUIエージェントのためのエージェントマルチモーダルメモリ制御学習
Authors: Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo,
Abstract要約: MLLMベースのGUIエージェントのためのプラグインエージェントメモリフレームワークである textbfMementoGUI を紹介する。 MementoCoreは、オンラインメモリの選択、圧縮、検索のための学習されたコントローラである。実験によると、MementoGUIは履歴なし、履歴再生、テキストのみのメモリベースラインよりもGUIエージェントを一貫して改善している。
参考スコア（独自算出の注目度）: 47.19679323562172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
Abstract（参考訳）: 近年のGUIエージェントは、視覚的接地とアクション予測においてかなりの進歩を遂げているが、多くのインタフェース遷移におけるタスク状態の維持を必要とする長期タスクでは脆弱なままである。既存のエージェントは、通常、生の履歴リプレイやテキストのみのメモリに依存しており、冗長なスクリーンショットでモデルを圧倒するか、将来の決定に必要な局所的な視覚的証拠を破棄する。これらの制限に対処するために,MLLMベースのGUIエージェントに,オンラインメモリ選択,圧縮,検索のための学習コントローラであるtextbf{MementoCore}を装備するプラグインエージェントメモリフレームワークである‘textbf{MementoGUI}を紹介した。ワークメモリは、テキストの要約とROIレベルの視覚的証拠を伴うタスク関連インタフェースイベントを選択的に保存し、エピソードメモリは学習された関連性の選択を通じて再利用可能な過去のトラジェクトリを検索する。 MementoCoreは、ステップ処理、メモリ圧縮、エピソード書き込み、エピソード選択のための特別な演算子にメモリ制御をモジュール化し、GUIエージェントのバックボーンを微調整することなく、プラグインメモリの拡張を可能にする。さらに,コンピュータ用トラジェクトリをメモリコントローラのトレーニングデータに変換するスケーラブルなデータキュレーションパイプラインを開発し,GUIエージェントにおける長期的意思決定評価のための \textbf{MementoGUI-Bench} を導入し,セマンティックアクションマッチング,タスク進捗,メモリ一貫性のためのMLLMベースのメトリクスを設計する。 GUI-Odyssey、MM-Mind2Web、MementoGUI-Benchの実験では、MementoGUIは履歴、履歴再生、テキストのみのメモリベースラインよりも一貫してGUIエージェントを改善し、より大きなMementoCoreバックボーンはメモリ拡張GUIコントロールをさらに強化している。

論文の概要: MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

関連論文リスト