See and Remember: A Multimodal Agent for Web Traversal
- URL: http://arxiv.org/abs/2603.02626v1
- Date: Tue, 03 Mar 2026 05:55:05 GMT
- Title: See and Remember: A Multimodal Agent for Web Traversal
- Authors: Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao,
- Abstract summary: V-GEMS is a robust multimodal agent architecture for web navigation.<n>Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking.<n> Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain.
- Score: 19.326814654711296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V-GEMS.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory [43.2995099083993]
VLA models have shown promising potential in embodied navigation by unifying perception and planning.<n>Most existing VLA models rely on reactive mappings directly from observations to actions.<n>We propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition.
arXiv Detail & Related papers (2026-01-13T15:43:43Z) - OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z) - TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making [90.18833928208333]
Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) is a new benchmark for long-horizon navigation involving multiple sub-demands with explicit task preferences.<n>For spatial memory, we design MASMap, which combines 3D point cloud accumulation with 2D semantic mapping for accurate and efficient environmental understanding.<n>Our approach outperforms state-of-the-art baselines in both perception accuracy and navigation robustness.
arXiv Detail & Related papers (2025-11-21T13:12:13Z) - WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance [29.57207599604568]
WebCoach is a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory.<n>WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories.<n> Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents.
arXiv Detail & Related papers (2025-11-17T05:38:50Z) - MGA: Memory-Driven GUI Agent for Observation-Centric Interaction [30.45490249299358]
We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide.<n>MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-28T08:19:58Z) - R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - Memory Proxy Maps for Visual Navigation [6.1190419149081245]
Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps.<n>Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent.
arXiv Detail & Related papers (2024-11-15T02:37:14Z) - Polyline Based Generative Navigable Space Segmentation for Autonomous
Visual Navigation [57.3062528453841]
We propose a representation-learning-based framework to enable robots to learn the navigable space segmentation in an unsupervised manner.
We show that the proposed PSV-Nets can learn the visual navigable space with high accuracy, even without any single label.
arXiv Detail & Related papers (2021-10-29T19:50:48Z) - Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.