ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
- URL: http://arxiv.org/abs/2602.21858v3
- Date: Sat, 28 Feb 2026 08:17:24 GMT
- Title: ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
- Authors: Dezhi Kong, Zhengzhao Feng, Qiliang Liang, Hao Wang, Haofei Sun, Changpeng Yang, Yang Li, Peng Zhou, Shuai Nie, Hongzhen Wang, Linfeng Zhou, Hao Jia, Jiaming Xu, Runyu Shi, Ying Huang,
- Abstract summary: This paper introduces ProactiveMobile, a benchmark for proactive mobile agent development.<n>It formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals.<n>The benchmark achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%) in experiments.
- Score: 17.39388308538324
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.
Related papers
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild [30.138230316314534]
We introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction to bidirectional intent alignment.<n>We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols.<n>We also develop MUSE, an automated framework utilizing an MLLM-as-a-judge multi-agent architecture.
arXiv Detail & Related papers (2026-02-12T09:25:15Z) - Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration [72.84714132070404]
We propose a framework that shifts from passive interaction to active exploration of visual environments.<n>Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier.<n>On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement)
arXiv Detail & Related papers (2026-02-11T17:29:17Z) - DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks [10.977990951788422]
DrawingBench is a verification framework for evaluating the trustworthiness of agentic LLMs.<n>Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels.<n>We evaluate four state-of-the-art LLMs across 1,000 tests.
arXiv Detail & Related papers (2025-12-01T01:18:21Z) - Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents [3.0745879700441385]
PROBE decomposes proactivity as a pipeline of three core capabilities.<n>We find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1.
arXiv Detail & Related papers (2025-10-22T17:00:45Z) - BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities [61.173773299032746]
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world.<n>We introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities.<n> BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning.<n>We propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.
arXiv Detail & Related papers (2025-10-09T19:18:36Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.03771007780976]
We tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions.<n>First, we collect real-world human activities to generate proactive task predictions.<n>These predictions are labeled by human annotators as either accepted or rejected.<n>The labeled data is used to train a reward model that simulates human judgment.
arXiv Detail & Related papers (2024-10-16T08:24:09Z) - AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.