See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
- URL: http://arxiv.org/abs/2509.13615v1
- Date: Wed, 17 Sep 2025 01:14:14 GMT
- Title: See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
- Authors: Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu,
- Abstract summary: multimodal agents' inability to reliably execute toggle control instructions remains a key bottleneck.<n>We propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly.<n>Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%.
- Score: 26.687510922403405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.
Related papers
- Computer-Using World Model [58.59112582915026]
We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next user interface (UI) state.<n> CUWM first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot.<n>We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution.
arXiv Detail & Related papers (2026-02-19T13:48:29Z) - ANCHOR: Branch-Point Data Generation for GUI Agents [52.22377425487]
End-to-end GUI agents for real desktop environments require large amounts of high-quality interaction data.<n>We present a trajectory expansion framework Anchor that bootstraps scalable desktop supervision from a small set of verified seed demonstrations.<n>Experiments on standard desktop benchmarks, OSWorld and WindowsAgentArena, show that models fine-tuned on our expanded corpus achieve consistent improvements.
arXiv Detail & Related papers (2026-02-06T19:55:26Z) - Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents [52.30603055218294]
Trajectory2Task is a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios.<n>It converts valid tool-call trajectories into user-facing tasks with controlled intent adaptations.<n>We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures.
arXiv Detail & Related papers (2026-01-28T00:36:13Z) - GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents [39.807839972627015]
We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks.<n>We introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding.<n>On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples.
arXiv Detail & Related papers (2026-01-14T14:27:28Z) - OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z) - SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios [7.983317067810301]
SWITCH (Semantic World Interface Tasks for Control and Handling) is an embodied, task-driven benchmark created through iterative releases to probe these gaps.<n>It evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification.<n>Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions.
arXiv Detail & Related papers (2025-11-20T09:52:20Z) - Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z) - Instruction Agent: Enhancing Agent with Expert Demonstration [12.67489098612846]
Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories.<n>In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult tasks.<n>Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution.
arXiv Detail & Related papers (2025-09-08T18:00:12Z) - FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents [12.315613848863784]
We introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations.<n>FineState-Bench includes 2257 task benchmarks in four components and uses a four-phase indicator for perception-to-control assessment.<n>Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.
arXiv Detail & Related papers (2025-08-12T15:12:42Z) - GTA1: GUI Test-time Scaling Agent [77.60727242084971]
This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
arXiv Detail & Related papers (2025-07-08T08:52:18Z) - XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration [73.87038197602268]
This study introduces a new perspective on evaluation methods for Device-Control Agents (DC agents)<n>We propose the XBOUND evaluation method, which employs the calculation of a novel Explore Metric to delineate the capability boundaries of DC agents.<n>We evaluate the OS-Atlas and UI-TARS series, examining both the overall and specific performance across five common tasks.
arXiv Detail & Related papers (2025-05-27T14:49:30Z) - Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents [33.899782380901314]
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts.<n>Existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes.<n>Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards.
arXiv Detail & Related papers (2025-05-17T07:58:34Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent [24.97846085313314]
We propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing.<n>We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection.<n>It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data.
arXiv Detail & Related papers (2024-12-24T13:41:47Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - Scalable Perception-Action-Communication Loops with Convolutional and
Graph Neural Networks [208.15591625749272]
We present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI)
Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning.
We demonstrate that VGAI yields performance comparable to or better than other decentralized controllers.
arXiv Detail & Related papers (2021-06-24T23:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.