Generative Visual Code Mobile World Models
- URL: http://arxiv.org/abs/2602.01576v1
- Date: Mon, 02 Feb 2026 03:12:16 GMT
- Title: Generative Visual Code Mobile World Models
- Authors: Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin,
- Abstract summary: Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time.<n>We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code.<n>We introduce gWorld, the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data.
- Score: 33.86938466546132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.
Related papers
- Code2World: A GUI World Model via Renderable Code Generation [37.96080847935199]
We propose Code2World, a vision-feedback coder that simulates the next visual state via renderable code generation.<n>Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image.
arXiv Detail & Related papers (2026-02-10T14:56:19Z) - MobileWorldBench: Towards Semantic World Modeling For Mobile Agents [43.504202016224234]
We introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models to function as world models for mobile GUI agents.<n>We release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs.<n>We propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates.
arXiv Detail & Related papers (2025-12-16T02:16:42Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - Mobile-Agent-v3: Fundamental Agents for GUI Automation [59.775510710011325]
This paper introduces a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models.<n>We propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld.
arXiv Detail & Related papers (2025-08-21T00:39:12Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [29.65579758217919]
name is the first reinforcement learning framework designed to enhance the capabilities of LVLMs in high-level real-world task scenarios.<n>name achieves superior performance using only 0.02% of the data compared to previous state-of-the-art methods like OS-Atlas.
arXiv Detail & Related papers (2025-04-14T17:45:54Z) - TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.<n>Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.<n>Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data [15.801018643716437]
This paper aims to enhance the GUI understanding and interacting capabilities of large vision-language models (LVLMs) through a data-driven approach.
We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web.
Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work.
arXiv Detail & Related papers (2024-10-25T10:46:17Z) - Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly perform pixel-level operations on the GUI.<n>We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots.<n>We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z) - Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.<n>Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.