AUTO-Explorer: Automated Data Collection for GUI Agent
- URL: http://arxiv.org/abs/2511.06417v1
- Date: Sun, 09 Nov 2025 15:13:45 GMT
- Title: AUTO-Explorer: Automated Data Collection for GUI Agent
- Authors: Xiangwu Guo, Difei Gao, Mike Zheng Shou,
- Abstract summary: We propose an automated data collection method with minimal annotation costs, named Auto-Explorer.<n>It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments.<n>Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set.
- Score: 58.58097564914626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in GUI agents have significantly expanded their ability to interpret natural language commands to manage software interfaces. However, acquiring GUI data remains a significant challenge. Existing methods often involve designing automated agents that browse URLs from the Common Crawl, using webpage HTML to collect screenshots and corresponding annotations, including the names and bounding boxes of UI elements. However, this method is difficult to apply to desktop software or some newly launched websites not included in the Common Crawl. While we expect the model to possess strong generalization capabilities to handle this, it is still crucial for personalized scenarios that require rapid and perfect adaptation to new software or websites. To address this, we propose an automated data collection method with minimal annotation costs, named Auto-Explorer. It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments, gathering data efficiently. Additionally, to assess the quality of exploration, we have developed the UIXplore benchmark. This benchmark creates environments for explorer agents to discover and save software states. Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set to evaluate the effectiveness of the exploration strategies. Our experiments demonstrate the superior performance of Auto-Explorer, showing that our method can quickly enhance the capabilities of an MLLM in explored software.
Related papers
- AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent [21.148033135113927]
We introduce an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation.<n>We propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings.<n>We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
arXiv Detail & Related papers (2025-11-30T11:32:54Z) - GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z) - GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking [55.762798168494726]
GUI-ReRank is a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques.<n>We evaluated our approach on an established NL-based GUI retrieval benchmark.
arXiv Detail & Related papers (2025-08-05T10:17:38Z) - GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.34801160469067]
MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge.<n>We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms.<n>With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents.
arXiv Detail & Related papers (2025-05-22T16:01:06Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation.<n>They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing.<n>These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z) - Leveraging Large Vision Language Model For Better Automatic Web GUI Testing [7.480576630392405]
This paper proposes VETL, the first LVLM-driven endtoend web testing technique.
With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context.
The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element.
arXiv Detail & Related papers (2024-10-16T01:37:58Z) - AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments.<n>AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces.<n>It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos.<n>Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z) - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [30.693616802332745]
This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks.
We propose an advanced Actor-Critic framework, which incorporates a sophisticated GUI driven by an AI agent and adept at handling lengthy procedural tasks.
arXiv Detail & Related papers (2023-12-20T15:28:38Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.