Retrieval-augmented GUI Agents with Generative Guidelines
- URL: http://arxiv.org/abs/2509.24183v1
- Date: Mon, 29 Sep 2025 02:04:20 GMT
- Title: Retrieval-augmented GUI Agents with Generative Guidelines
- Authors: Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu,
- Abstract summary: We propose RAG-GUI, a lightweight VLM that leverages web tutorials at inference time.<n>RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF)<n>It consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes.
- Score: 45.75975553907856
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.
Related papers
- Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents [52.30603055218294]
Trajectory2Task is a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios.<n>It converts valid tool-call trajectories into user-facing tasks with controlled intent adaptations.<n>We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures.
arXiv Detail & Related papers (2026-01-28T00:36:13Z) - MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux [37.49192877577783]
We present MagicGUI-RMS, a multi-agent reward model system that delivers adaptive trajectory evaluation, corrective feedback, and self-evolving learning capabilities.<n>To support reward learning at scale, we design a structured data construction pipeline that automatically produces balanced and diverse reward datasets.<n>Experiments demonstrate that MagicGUI-RMS yields substantial gains in task accuracy, behavioral robustness.
arXiv Detail & Related papers (2026-01-19T13:50:43Z) - AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent [21.148033135113927]
We introduce an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation.<n>We propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings.<n>We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
arXiv Detail & Related papers (2025-11-30T11:32:54Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time.<n>This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z) - GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [16.72683291432717]
name is the first reinforcement learning framework designed to enhance the capabilities of LVLMs in high-level real-world task scenarios.<n>name achieves superior performance using only 0.02% of the data compared to previous state-of-the-art methods like OS-Atlas.
arXiv Detail & Related papers (2025-04-14T17:45:54Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.