Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills
- URL: http://arxiv.org/abs/2506.10387v1
- Date: Thu, 12 Jun 2025 06:21:19 GMT
- Title: Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills
- Authors: Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie,
- Abstract summary: We propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge.<n>It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning.<n>To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm.
- Score: 57.740236400672046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/
Related papers
- VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents.
Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments.
VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z) - Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks [50.13429055093534]
We propose a Hybrid Multimodal Memory module to address the above challenges.
It transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge.
It also summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning.
On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector.
arXiv Detail & Related papers (2024-08-07T08:16:32Z) - GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS [4.172988187048097]
GPUDrive is a GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine.<n>We train reinforcement learning agents on the Open Motion dataset, achieving efficient goal-reaching in minutes and scaling to thousands of scenarios in hours.
arXiv Detail & Related papers (2024-08-02T21:37:46Z) - GPT-4V(ision) is a Generalist Web Agent, if Grounded [20.940613419944015]
We show that GPT-4V can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.
We propose SEEACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.
arXiv Detail & Related papers (2024-01-03T08:33:09Z) - Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [79.31134731122462]
We introduce OpenAct benchmark to evaluate the open-domain task-solving capability, built on human expert consultation and repositories in GitHub.<n>We present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - See and Think: Embodied Agent in Virtual Environment [12.801720916220823]
Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks.
This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment.
arXiv Detail & Related papers (2023-11-26T06:38:16Z) - JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal
Language Models [38.77967315158286]
We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions)
We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences.
JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans.
arXiv Detail & Related papers (2023-11-10T11:17:58Z) - Ghost in the Minecraft: Generally Capable Agents for Open-World
Environments via Large Language Models with Text-based Knowledge and Memory [97.87093169454431]
Ghost in the Minecraft (GITM) is a novel framework that integrates Large Language Models (LLMs) with text-based knowledge and memory.
We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute.
The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate.
arXiv Detail & Related papers (2023-05-25T17:59:49Z) - MineDojo: Building Open-Ended Embodied Agents with Internet-Scale
Knowledge [70.47759528596711]
We introduce MineDojo, a new framework built on the popular Minecraft game.
We propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function.
Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward.
arXiv Detail & Related papers (2022-06-17T15:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.