MobileDreamer: Generative Sketch World Model for GUI Agent
- URL: http://arxiv.org/abs/2601.04035v1
- Date: Wed, 07 Jan 2026 15:51:44 GMT
- Title: MobileDreamer: Generative Sketch World Model for GUI Agent
- Authors: Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, Wan Guanglu,
- Abstract summary: Mobile GUI agents have shown strong potential in real-world automation and practical applications.<n>MobileDreamer is an efficient world-model-based look framework to equip the GUI agents based on the future imagination.<n>It consists of textual sketch world model and rollout imagination for GUI agent.
- Score: 17.169413605980015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.
Related papers
- DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos [110.98100817695307]
We introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos.<n>Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning.
arXiv Detail & Related papers (2026-02-06T18:49:43Z) - MobileWorldBench: Towards Semantic World Modeling For Mobile Agents [43.504202016224234]
We introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models to function as world models for mobile GUI agents.<n>We release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs.<n>We propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates.
arXiv Detail & Related papers (2025-12-16T02:16:42Z) - Semantic World Models [11.950909271505923]
Planning with world models offers a powerful paradigm for robotic control.<n>This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future.
arXiv Detail & Related papers (2025-10-22T17:53:45Z) - Latent Action Pretraining Through World Modeling [1.988007188564225]
We propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way.<n>Our framework is designed to be effective for transferring across tasks, environments, and embodiments.
arXiv Detail & Related papers (2025-09-22T21:19:10Z) - ViMo: A Generative Visual GUI World Model for App Agents [60.27668506731929]
ViMo is a visual world model designed to generate future App observations as images.<n>We propose a novel data representation, the Symbolic Text Representation, to overlay text content with symbolic placeholders.<n>With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text.
arXiv Detail & Related papers (2025-04-15T14:03:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI.<n>We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots.<n>We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.