Understanding the Weakness of Large Language Model Agents within a
Complex Android Environment
- URL: http://arxiv.org/abs/2402.06596v1
- Date: Fri, 9 Feb 2024 18:19:25 GMT
- Title: Understanding the Weakness of Large Language Model Agents within a
Complex Android Environment
- Authors: Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao
- Abstract summary: Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games.
LLMs face three primary challenges when applied to general-purpose software systems like operating systems.
These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system.
- Score: 21.278266207772756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have empowered intelligent agents to execute
intricate tasks within domain-specific software such as browsers and games.
However, when applied to general-purpose software systems like operating
systems, LLM agents face three primary challenges. Firstly, the action space is
vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date
understanding and deliver accurate responses. Secondly, real-world tasks often
require inter-application cooperation}, demanding farsighted planning from LLM
agents. Thirdly, agents need to identify optimal solutions aligning with user
constraints, such as security concerns and preferences. These challenges
motivate AndroidArena, an environment and benchmark designed to evaluate LLM
agents on a modern operating system. To address high-cost of manpower, we
design a scalable and semi-automated method to construct the benchmark. In the
task evaluation, AndroidArena incorporates accurate and adaptive metrics to
address the issue of non-unique solutions. Our findings reveal that even
state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to
specific constraints. Additionally, we identify a lack of four key
capabilities, i.e., understanding, reasoning, exploration, and reflection, as
primary reasons for the failure of LLM agents. Furthermore, we provide
empirical analysis on the failure of reflection, and improve the success rate
by 27% with our proposed exploration strategy. This work is the first to
present valuable insights in understanding fine-grained weakness of LLM agents,
and offers a path forward for future research in this area. Environment,
benchmark, and evaluation code for AndroidArena are released at
https://github.com/AndroidArenaAgent/AndroidArena.
Related papers
- AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Agentless: Demystifying LLM-based Software Engineering Agents [12.19683999553113]
We build Agentless -- an agentless approach to automatically solve software development problems.
Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation.
Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance and low cost.
arXiv Detail & Related papers (2024-07-01T17:24:45Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications [46.85306320942487]
Large Language Models (LLMs) are evolving to actively engage with tools and performing actions on real-world applications and services.
Today, humans verify the correctness and appropriateness of the LLM-generated outputs before putting them into real-world execution.
This poses significant challenges as code comprehension is well known to be notoriously difficult.
In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future.
arXiv Detail & Related papers (2024-04-10T11:17:33Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets.
LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions.
We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.