Agent Workflow Memory
- URL: http://arxiv.org/abs/2409.07429v1
- Date: Wed, 11 Sep 2024 17:21:00 GMT
- Title: Agent Workflow Memory
- Authors: Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig,
- Abstract summary: We introduce Agent Memory, a method for inducing commonly reused routines.
AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate.
Online AWM robustly generalizes in cross-task, website, and domain evaluations.
- Score: 71.81385627556398
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
Related papers
- AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.
We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.
We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks.
We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models.
The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - PLANRL: A Motion Planning and Imitation Learning Framework to Bootstrap Reinforcement Learning [13.564676246832544]
We introduce PLANRL, a framework that chooses when the robot should use classical motion planning and when it should learn a policy.
PLANRL switches between two modes of operation: reaching a waypoint using classical techniques when away from the objects and fine-grained manipulation control when about to interact with objects.
We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods.
arXiv Detail & Related papers (2024-08-07T19:30:08Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Couler: Unified Machine Learning Workflow Optimization in Cloud [6.769259207650922]
Couler is a system designed for unified ML workflow optimization in the cloud.
We integrate Large Language Models (LLMs) into workflow generation, and provide a unified programming interface for various workflow engines.
Couer has successfully improved the CPU/Memory utilization by more than 15% and the workflow completion rate by around 17%.
arXiv Detail & Related papers (2024-03-12T12:47:32Z) - WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible.
We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains.
We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.