Understanding the Weakness of Large Language Model Agents within a
Complex Android Environment
- URL: http://arxiv.org/abs/2402.06596v1
- Date: Fri, 9 Feb 2024 18:19:25 GMT
- Title: Understanding the Weakness of Large Language Model Agents within a
Complex Android Environment
- Authors: Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao
- Abstract summary: Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games.
LLMs face three primary challenges when applied to general-purpose software systems like operating systems.
These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system.
- Score: 21.278266207772756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have empowered intelligent agents to execute
intricate tasks within domain-specific software such as browsers and games.
However, when applied to general-purpose software systems like operating
systems, LLM agents face three primary challenges. Firstly, the action space is
vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date
understanding and deliver accurate responses. Secondly, real-world tasks often
require inter-application cooperation}, demanding farsighted planning from LLM
agents. Thirdly, agents need to identify optimal solutions aligning with user
constraints, such as security concerns and preferences. These challenges
motivate AndroidArena, an environment and benchmark designed to evaluate LLM
agents on a modern operating system. To address high-cost of manpower, we
design a scalable and semi-automated method to construct the benchmark. In the
task evaluation, AndroidArena incorporates accurate and adaptive metrics to
address the issue of non-unique solutions. Our findings reveal that even
state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to
specific constraints. Additionally, we identify a lack of four key
capabilities, i.e., understanding, reasoning, exploration, and reflection, as
primary reasons for the failure of LLM agents. Furthermore, we provide
empirical analysis on the failure of reflection, and improve the success rate
by 27% with our proposed exploration strategy. This work is the first to
present valuable insights in understanding fine-grained weakness of LLM agents,
and offers a path forward for future research in this area. Environment,
benchmark, and evaluation code for AndroidArena are released at
https://github.com/AndroidArenaAgent/AndroidArena.
Related papers
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Agentless: Demystifying LLM-based Software Engineering Agents [12.19683999553113]
We build Agentless -- an agentless approach to automatically solve software development problems.
Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation.
Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance and low cost.
arXiv Detail & Related papers (2024-07-01T17:24:45Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z) - Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets.
LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions.
We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.