MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
- URL: http://arxiv.org/abs/2512.14014v1
- Date: Tue, 16 Dec 2025 02:16:42 GMT
- Title: MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
- Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover,
- Abstract summary: We introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models to function as world models for mobile GUI agents.<n>We release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs.<n>We propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates.
- Score: 43.504202016224234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld
Related papers
- WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models [114.95269118652163]
We introduce WorldArena, a unified benchmark designed to evaluate embodied world models across both perceptual and functional dimensions.<n>WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation.<n>Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability.
arXiv Detail & Related papers (2026-02-09T18:09:20Z) - Generative Visual Code Mobile World Models [33.86938466546132]
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time.<n>We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code.<n>We introduce gWorld, the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data.
arXiv Detail & Related papers (2026-02-02T03:12:16Z) - MobileDreamer: Generative Sketch World Model for GUI Agent [17.169413605980015]
Mobile GUI agents have shown strong potential in real-world automation and practical applications.<n>MobileDreamer is an efficient world-model-based look framework to equip the GUI agents based on the future imagination.<n>It consists of textual sketch world model and rollout imagination for GUI agent.
arXiv Detail & Related papers (2026-01-07T15:51:44Z) - World-in-World: World Models in a Closed-Loop World [123.85805788728128]
We introduce World-in-World, the first open platform that benchmarks world models in a closed-loop world that mirrors real agent-environment interactions.<n>We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality.<n>Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed
arXiv Detail & Related papers (2025-10-20T22:09:15Z) - Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z) - PoE-World: Compositional World Modeling with Products of Programmatic Experts [50.35012247866856]
Learning how the world works is central to building AI agents that can adapt to complex environments.<n>Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code.<n>We show that this approach can learn complex world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge.
arXiv Detail & Related papers (2025-05-16T03:28:42Z) - ViMo: A Generative Visual GUI World Model for App Agents [60.27668506731929]
ViMo is a visual world model designed to generate future App observations as images.<n>We propose a novel data representation, the Symbolic Text Representation, to overlay text content with symbolic placeholders.<n>With this design, ViMo employs a STR Predictor to predict future GUIs' graphics and a GUI-text Predictor for generating the corresponding text.
arXiv Detail & Related papers (2025-04-15T14:03:10Z) - TrajLLM: A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation [3.8106509573548286]
This work leverages Large Language Models (LLMs) to simulate human mobility, addressing challenges like high costs and privacy concerns in traditional models.<n>Our hierarchical framework integrates persona generation, activity selection, and destination prediction, using real-world demographic and psychological data.
arXiv Detail & Related papers (2025-02-26T00:13:26Z) - Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment.
We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents.
Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.