Related papers: M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

URL: http://arxiv.org/abs/2603.00503v1
Date: Sat, 28 Feb 2026 06:59:51 GMT
Title: M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval
Authors: Dawei Yan, Haokui Zhang, Guangda Huzhang, Yang Li, Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Ying Li, Wei Dong, Chunhua Shen,
Abstract summary: M$2$ is a training-free, memory-augmented framework designed to optimize context efficiency and decision-making.<n>Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank.
Score: 64.06936170117943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.

Related papers

Dual Latent Memory for Visual Multi-agent System [69.29799381195592]
Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration.<n>Increasing agent turns often degrades performance while exponentially inflating token costs.<n>We propose L$2$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories.
arXiv Detail & Related papers (2026-01-31T02:49:10Z)
Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents [26.39049374286037]
External memory systems are pivotal for enabling Large Language Model (LLM) agents to maintain persistent knowledge and perform long-horizon decision-making.<n>Existing paradigms typically follow a two-stage process: computationally expensive memory construction followed by naive retrieval-augmented generation.<n>We propose CoM (Chain-of-Memory), a novel framework that advocates for a paradigm shift toward lightweight construction paired with sophisticated utilization.
arXiv Detail & Related papers (2026-01-14T04:42:15Z)
EvoRoute: Experience-Driven Self-Routing LLM Agent Systems [100.64399490164959]
EvoRoute is a self-evolving model routing paradigm that transcends static, pre-defined model assignments.<n> Experiments on challenging agentic benchmarks demonstrate that EvoRoute, when integrated into off-the-shelf agentic systems, not only sustains or enhances system performance but also reduces execution cost by up to $80%$ and latency by over $70%$.
arXiv Detail & Related papers (2026-01-06T04:06:46Z)
Efficient-VLN: A Training-Efficient Vision-Language Navigation Model [24.261272070476934]
Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN)<n>We propose Efficient-VLN, a training-efficient VLN model.<n>Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms.<n>Experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR)
arXiv Detail & Related papers (2025-12-11T05:57:48Z)
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window [88.85901839023803]
DeepMiner is a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window.<n>We develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks.
arXiv Detail & Related papers (2025-10-09T14:31:39Z)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z)
Balancing Performance and Efficiency in Zero-shot Robotic Navigation [1.6574413179773757]
We present an optimization study of the Vision-Language Frontier Maps applied to the Object Goal Navigation task in robotics. Our work evaluates the efficiency and performance of various vision-language models, object detectors, segmentation models, and Visual Question Answering modules.
arXiv Detail & Related papers (2024-06-05T07:31:05Z)
Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning [81.0108753452546]
We propose Dynamic Reversible Dual-Residual Networks, or Dr$2$Net, to finetune a pretrained model with substantially reduced memory consumption. Dr$2$Net contains two types of residual connections, one maintaining the residual structure in the pretrained models, and the other making the network reversible. We show that Dr$2$Net can reach comparable performance to conventional finetuning but with significantly less memory usage.
arXiv Detail & Related papers (2024-01-08T18:59:31Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.