Related papers: MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

URL: http://arxiv.org/abs/2601.20335v2
Date: Thu, 29 Jan 2026 02:32:46 GMT
Title: MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment
Authors: Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan,
Abstract summary: MobileBench-OL is an online benchmark with 1080 tasks from 80 Chinese apps.<n>It measures task execution, complex reasoning, and noise robustness of agents.<n>MobileBench-OL shows significant room for improvement to meet real-world requirements.
Score: 17.207878975582556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

Related papers

GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments [19.665566262516275]
AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation.<n>MobileWorld is a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks.
arXiv Detail & Related papers (2025-12-22T14:31:28Z)
Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents [6.501527187326423]
MobiBench is a modular and multi path offline benchmarking framework for mobile GUI agents.<n>It enables high fidelity, scalable, and reproducible evaluation entirely in offline settings.<n>Our experiments demonstrate that MobiBench GUI achieves 94.72 percent agreement with human evaluators.
arXiv Detail & Related papers (2025-12-14T10:41:39Z)
Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation [5.332969177132911]
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues.<n>We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries.
arXiv Detail & Related papers (2025-10-10T04:42:02Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments.<n>AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces.<n>It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos.<n>Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents. Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.