Related papers: AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

URL: http://arxiv.org/abs/2512.21302v1
Date: Wed, 24 Dec 2025 17:40:42 GMT
Title: AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng,
Abstract summary: We introduce AndroidLens, a challenging evaluation framework for mobile GUI agents.<n>It comprises 571 long-latency tasks in both Chinese and English environments.<n>Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP.
Score: 36.66219528445988
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

Related papers

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments [19.665566262516275]
We introduce MobileWorld, a benchmark designed to better reflect real-world mobile usage.<n>MobileWorld comprises 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld.<n>Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively.
arXiv Detail & Related papers (2025-12-22T14:31:28Z)
Step-GUI Technical Report [84.83795946544292]
We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System.<n>We also introduce Step-GUI, a family of models that achieves state-of-the-art GUI performance.<n>To assess whether agents can handle authentic everyday usage, we introduce AndroidDaily.
arXiv Detail & Related papers (2025-12-17T13:26:30Z)
GUI-360$^\circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents [59.107657859025586]
GUI-360$circ$ is a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs)<n>The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications.<n>The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space.
arXiv Detail & Related papers (2025-11-06T12:19:02Z)
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [86.4588675093384]
Toolathlon is a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.<n>This benchmark includes 108 manually sourced or crafted tasks, requiring interacting with multiple Apps over around 20 turns on average to complete.<n>We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
arXiv Detail & Related papers (2025-10-29T17:32:49Z)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks [37.79008306764891]
Real-world tasks are often complex and allow for multiple valid solutions.<n> offline benchmarks can only validate a single predefined "golden path"<n>Online dynamic testing is constrained by the complexity and non-reproducibility of real devices.<n>This paper introduces a novel graph-structured benchmarking framework.
arXiv Detail & Related papers (2025-10-16T12:30:05Z)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents [88.35544552383581]
We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, Linux, iOS, Android, and Web platforms.<n>It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents.
arXiv Detail & Related papers (2025-07-25T17:59:26Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point [17.165899818213475]
We introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications.<n>WorldGUI-Agent is a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization.
arXiv Detail & Related papers (2025-02-12T01:06:10Z)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents. Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.