BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
- URL: http://arxiv.org/abs/2509.15566v4
- Date: Mon, 27 Oct 2025 11:34:58 GMT
- Title: BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
- Authors: Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan,
- Abstract summary: "Blink-Think-Link" is a brain-inspired framework for human-GUI interaction.<n>The system decomposes interactions into three biologically plausible phases.<n>Blink Data Generation and BTL Reward enable reinforcement learning driven by both process and outcome.
- Score: 19.79016351559358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates competitive performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.
Related papers
- EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration [16.593979443102754]
We introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory.<n>First, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model.<n>Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''<n>Third, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process.
arXiv Detail & Related papers (2025-12-22T13:42:18Z) - AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent [21.148033135113927]
We introduce an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation.<n>We propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings.<n>We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
arXiv Detail & Related papers (2025-11-30T11:32:54Z) - History-Aware Reasoning for GUI Agents [15.519853892615272]
Current methods integrate Reinforcement Learning with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement.<n>We propose a History-Aware Reasoning framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge.<n>We develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware.
arXiv Detail & Related papers (2025-11-12T09:06:25Z) - Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents [15.303188467166752]
We present CogniGUI, a cognitive framework developed to overcome limitations by enabling adaptive learning for GUI automation resembling human-like behavior.<n>To assess the generalization and adaptability of agent systems, we introduce ScreenSeek, a comprehensive benchmark that includes multi application navigation, dynamic state transitions, and cross interface coherence.<n> Experimental results demonstrate that CogniGUI surpasses state-of-the-art methods in both the current GUI grounding benchmarks and our newly proposed benchmark.
arXiv Detail & Related papers (2025-06-22T06:30:52Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction.<n>Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms.<n>This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z) - Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems [57.30711059396246]
Current Graphical User Interface (GUI) grounding systems locate interface elements based on natural language instructions.<n>Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis.
arXiv Detail & Related papers (2025-03-09T06:14:17Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - GUI Agents: A Survey [157.9623286951606]
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction.<n>Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods.
arXiv Detail & Related papers (2024-12-18T04:48:28Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.