Related papers: OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents

OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents

URL: http://arxiv.org/abs/2503.16465v3
Date: Mon, 14 Jul 2025 15:56:44 GMT
Title: OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents
Authors: Pengzhou Cheng, Zheng Wu, Zongru Wu, Aston Zhang, Zhuosheng Zhang, Gongshen Liu,
Abstract summary: We introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step.<n>We show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios.
Score: 37.92783542037974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59\%$\sim$87.29\% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at https://github.com/Wuzheng02/OS-Kairos.

Related papers

Modeling Distinct Human Interaction in Web Agents [59.600507469754575]
We introduce the task of modeling human intervention to support collaborative web task execution.<n>We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover.<n>We deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness.
arXiv Detail & Related papers (2026-02-19T18:11:28Z)
ColorAgent: Building A Robust, Personalized, and Interactive OS Agent [48.95201741635228]
Building operating system (OS) agents capable of executing user instructions and faithfully following user desires is becoming a reality.<n>We present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment.<n>We explore personalized user intent recognition and proactive engagement, positioning the OS agent as a warm, collaborative partner.
arXiv Detail & Related papers (2025-10-22T09:02:48Z)
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents [39.3943822850841]
We introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm.<n>We show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art.
arXiv Detail & Related papers (2025-09-09T09:46:01Z)
CoAct-1: Computer-using Agents with Coding as Actions [94.99657662893338]
CoAct-1 is a novel multi-agent system that combines GUI-based control with direct programmatic execution.<n>We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%.
arXiv Detail & Related papers (2025-08-05T21:33:36Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes. We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z)
Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions [25.464036307823974]
SocialEgoNet is a graph-based framework that exploits task dependencies through a hierarchical learning approach.<n>SocialEgoNet uses body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed.<n>For evaluation, we augment an existing egocentric human-agent interaction with new class labels and bounding box annotations.
arXiv Detail & Related papers (2024-12-21T16:54:28Z)
AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs) We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z)
Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions [12.074590482085831]
We seek to codify factors governing safe multi-agent interactions via the lens of responsibility. We propose a data-driven modeling approach based on control barrier functions and differentiable optimization.
arXiv Detail & Related papers (2024-10-09T20:20:41Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface. We propose a chain-of-action technique to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z)
Realistic simulation of users for IT systems in cyber ranges [63.20765930558542]
We instrument each machine by means of an external agent to generate user activity. This agent combines both deterministic and deep learning based methods to adapt to different environment. We also propose conditional text generation models to facilitate the creation of conversations and documents.
arXiv Detail & Related papers (2021-11-23T10:53:29Z)
Optimizing Interactive Systems via Data-Driven Objectives [70.3578528542663]
We propose an approach that infers the objective directly from observed user interactions. These inferences can be made regardless of prior knowledge and across different types of user behavior. We introduce Interactive System (ISO), a novel algorithm that uses these inferred objectives for optimization.
arXiv Detail & Related papers (2020-06-19T20:49:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.