Do Phone-Use Agents Respect Your Privacy?
Abstract Overview
This paper investigates whether phone-use agents handle user data appropriately while completing benign mobile tasks. The authors introduce MyPhoneBench, a verifiable evaluation framework that operationalizes privacy-respecting behavior through the iMy privacy contract, instrumented mock Android apps, and rule-based auditing of agent data handling at the level of individual form entries. The framework defines privacy compliance as permissioned access, minimal disclosure, and user-controlled memory, and tests agents using three privacy probes: over-permissioning, trap resistance, and form minimization. Across five frontier models, 10 apps, and 300 tasks, the study demonstrates that task success does not reliably indicate privacy-compliant behavior, and no single model dominates all evaluation axes.
Novelty
The key contribution is a benchmark that makes privacy behavior in phone-use agents an auditable, reproducible evaluation problem. It combines an explicit execution-time privacy contract (iMy) with controlled apps that log form-level actions and three structured privacy probes (over-permissioning, trap resistance, form minimization), enabling deterministic checks of privacy violations during realistic mobile workflows—a capability absent from existing phone-use agent benchmarks.
Results
Experiments across five frontier models show that task success, privacy-qualified success, and later-session use of saved preferences are distinct capabilities with different model rankings on each axis (e.g., Claude Opus 4.6 leads task success at 82.8% but Kimi K2.5 leads average privacy at 77.3%, and Qwen 3.5 Plus leads privacy-qualified success at 47.6%). The most persistent privacy failure across all models is form minimization—agents consistently fill optional personal fields not required by the task—with scores as low as 41% on identity-dense apps like mDMV.
Key Points
- MyPhoneBench evaluates mobile agents using an explicit privacy contract (iMy), instrumented mock apps, and deterministic auditing across three privacy probes: over-permissioning, trap resistance, and form minimization.
- Across five frontier models on 300 tasks, no single model dominates all three evaluation axes (task success, privacy-qualified success, and later-session preference use), and jointly evaluating success and privacy reshuffles model rankings relative to success-only evaluation.
- The most persistent privacy failure is unnecessary completion of optional personal information fields (form minimization), consistent with completion-oriented bias rather than access-control confusion.
References
- arXiv: https://arxiv.org/abs/2604.00986v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.00986v1
- Hugging Face Papers: https://huggingface.co/papers/2604.00986
- GitHub: https://github.com/tangzhy/MyPhoneBench