GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
- URL: http://arxiv.org/abs/2503.17709v1
- Date: Sat, 22 Mar 2025 09:30:37 GMT
- Title: GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
- Authors: Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, Chongyang Zhang,
- Abstract summary: We introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization.<n>To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning.
- Score: 22.814882629516635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
Related papers
- A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction.
Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms.
This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - GUI Agents: A Survey [129.94551809688377]
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction.<n>Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods.
arXiv Detail & Related papers (2024-12-18T04:48:28Z) - Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism.<n>In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation.<n>Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.<n>Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.<n>To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation.<n>They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing.<n>These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding [73.9254861755974]
This paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations.<n>We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.