From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents
- URL: http://arxiv.org/abs/2512.13438v1
- Date: Mon, 15 Dec 2025 15:34:06 GMT
- Title: From User Interface to Agent Interface: Efficiency Optimization of UI Representations for LLM Agents
- Authors: Dezhi Ran, Zhi Gong, Yuzhe Guo, Mengzhou Wu, Yuan Cao, Haochuan Lu, Hengyu Zhang, Xia Zeng, Gang Cao, Liangchao Yao, Yuetang Deng, Wei Yang, Tao Xie,
- Abstract summary: Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants.<n>While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked.<n>We present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization.
- Score: 21.811753076804944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co-optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error-prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain-specific language (DSL) that captures UI-specific operations. Second, UIFormer conducts LLM-based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency-completeness co-optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real-world industry deployment at WeChat further validates the practical impact of UIFormer.
Related papers
- AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent [21.148033135113927]
We introduce an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation.<n>We propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings.<n>We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
arXiv Detail & Related papers (2025-11-30T11:32:54Z) - UI-UG: A Unified MLLM for UI Understanding and Generation [19.7078650905834]
We introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities.<n>For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding.<n>For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs.
arXiv Detail & Related papers (2025-09-29T06:59:09Z) - Less is More: Empowering GUI Agent with Context-Aware Simplification [62.02157661751793]
We propose a context-aware framework for building an efficient and effective GUI Agent, termed SimpAgent.<n>With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances.
arXiv Detail & Related papers (2025-07-04T17:37:15Z) - Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time.<n>This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity [7.299239909796724]
UI-JEPA is a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data.
We introduce two new multimodal datasets, "Intent in the Wild" (IIW) and "Intent in the Tame" (IIT)
IIW consists of 1.7K videos across 219 intent categories, while IIT contains 914 videos across 10 categories.
arXiv Detail & Related papers (2024-09-06T07:44:44Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.