Related papers: ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

URL: http://arxiv.org/abs/2410.11872v2
Date: Thu, 17 Oct 2024 07:12:31 GMT
Title: ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
Authors: Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoshchuk, Artur Janicki,
Abstract summary: ClickAgent is a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model identifies the relevant UI elements on the screen. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.

Related papers

UFO2: The Desktop AgentOS [60.317812905300336]
UFO2 is a multiagent AgentOS for Windows desktops that elevates into practical, system-level automation. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
arXiv Detail & Related papers (2025-04-20T13:04:43Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent. From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection [38.833925781308665]
We introduce textitInfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills. textitInfiGUIAgent achieves competitive performance on several GUI benchmarks.
arXiv Detail & Related papers (2025-01-08T15:45:21Z)
AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation [27.984521240600493]
Large language models (LLMs) have brought exciting new advances to mobile UI agents. One way to reduce the required model size is to customize a smaller domain-specific model. We propose to convert the UI task automation problem to a code generation problem.
arXiv Detail & Related papers (2024-12-24T02:54:56Z)
AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs) We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents. SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines. A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
TinyClick: Single-Turn Agent for Empowering GUI Automation [0.18846515534317265]
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.
arXiv Detail & Related papers (2024-10-09T12:06:43Z)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents [7.4568642040547894]
Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents. We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
arXiv Detail & Related papers (2024-06-12T13:14:50Z)
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images. By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data. Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
CogAgent: A Visual Language Model for GUI Agents [61.26491779502794]
We introduce CogAgent, a visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120. CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE.
arXiv Detail & Related papers (2023-12-14T13:20:57Z)
Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing [17.24045904273874]
We propose DroidAgent, an autonomous GUI testing agent for Android. It is based on Large Language Models and support mechanisms such as long- and short-term memory. DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques.
arXiv Detail & Related papers (2023-11-15T01:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.