UFO: A UI-Focused Agent for Windows OS Interaction
- URL: http://arxiv.org/abs/2402.07939v5
- Date: Thu, 23 May 2024 05:18:58 GMT
- Title: UFO: A UI-Focused Agent for Windows OS Interaction
- Authors: Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang,
- Abstract summary: We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS.
UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications.
We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage.
- Score: 40.9389397337166
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users' daily usage. The results, derived from both quantitative metrics and real-case studies, underscore the superior effectiveness of UFO in fulfilling user requests. To the best of our knowledge, UFO stands as the first UI agent specifically tailored for task completion within the Windows OS environment. The open-source code for UFO is available on https://github.com/microsoft/UFO.
Related papers
- UFO2: The Desktop AgentOS [60.317812905300336]
UFO2 is a multiagent AgentOS for Windows desktops that elevates into practical, system-level automation.
We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs.
Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
arXiv Detail & Related papers (2025-04-20T13:04:43Z) - UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction [16.731754927372585]
We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents.
Unlike online benchmarks, UI-Vision provides dense, high-quality annotations of human demonstrations.
Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B.
arXiv Detail & Related papers (2025-03-19T19:26:17Z) - Large Action Models: From Inception to Implementation [51.81485642442344]
Large Action Models (LAMs) are designed for action generation and execution within dynamic environments.
LAMs hold the potential to transform AI from passive language understanding to active task completion.
We present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment.
arXiv Detail & Related papers (2024-12-13T11:19:56Z) - Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents [40.86728610906313]
AXIS is a novel LLM-based agents framework that prioritizes actions through application programming interfaces (APIs) over user interface actions.
Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans.
It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS)
arXiv Detail & Related papers (2024-09-25T17:58:08Z) - AppAgent v2: Advanced Agent for Flexible Mobile Interactions [46.789563920416626]
This work introduces a novel LLM-based multimodal agent framework for mobile devices.
Our agent constructs a flexible action space that enhances adaptability across various applications.
Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z) - Human-Centered LLM-Agent User Interface: A Position Paper [8.675534401018407]
Large Language Model (LLM) -in-the-loop applications have been shown to effectively interpret the human user's commands.
A user mostly ignorant to the underlying tools/systems should be able to work with a LAUI to discover an emergent workflow.
arXiv Detail & Related papers (2024-05-19T13:02:45Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - UFO: Unidentified Foreground Object Detection in 3D Point Cloud [7.286344230797102]
Existing 3D object detectors encounter hard challenges in both 3D localization and Out-of-Distribution detection.
We suggest a new UFO detection framework including three tasks: evaluation protocol, methodology, and benchmark.
The proposed framework consistently enhances performance by a large margin across all four baseline detectors.
arXiv Detail & Related papers (2024-01-08T12:16:06Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - UFO: Unified Feature Optimization [67.77936811483664]
This paper proposes a novel Unified Feature Optimization (UFO) paradigm for training and deploying deep models.
UFO aims to benefit each single task with a large-scale pretraining on all tasks.
UFO provides great convenience for flexible deployment, while maintaining the benefits of large-scale pretraining.
arXiv Detail & Related papers (2022-07-21T07:34:06Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.