Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer
Control
- URL: http://arxiv.org/abs/2306.07863v3
- Date: Fri, 19 Jan 2024 06:59:26 GMT
- Title: Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer
Control
- Authors: Longtao Zheng, Rundong Wang, Xinrun Wang, Bo An
- Abstract summary: Building agents with large language models for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete tasks.
Previous computer agents have demonstrated the benefits of in-context learning (ICL), but their performance is hindered by several issues.
We introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-
- Score: 23.115574119132507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building agents with large language models (LLMs) for computer control is a
burgeoning research area, where the agent receives computer states and performs
actions to complete complex tasks. Previous computer agents have demonstrated
the benefits of in-context learning (ICL); however, their performance is
hindered by several issues. First, the limited context length of LLMs and
complex computer states restrict the number of exemplars, as a single webpage
can consume the entire context. Second, the exemplars in current methods, such
as high-level plans and multi-choice questions, cannot represent complete
trajectories, leading to suboptimal performance in long-horizon tasks. Third,
existing computer agents rely on task-specific exemplars and overlook the
similarity among tasks, resulting in poor generalization to novel tasks. To
address these challenges, we introduce Synapse, a computer agent featuring
three key components: i) state abstraction, which filters out task-irrelevant
information from raw states, allowing more exemplars within the limited
context, ii) trajectory-as-exemplar prompting, which prompts the LLM with
complete trajectories of the abstracted states and actions to improve
multi-step decision-making, and iii) exemplar memory, which stores the
embeddings of exemplars and retrieves them via similarity search for
generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard
task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse
achieves a 99.2% average success rate (a 10% relative improvement) across 64
tasks using demonstrations from only 48 tasks. Notably, Synapse is the first
ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a
56% relative improvement in average step success rate over the previous
state-of-the-art prompting scheme in Mind2Web.
Related papers
- PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.
From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI)
Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z) - Spatial Reasoning and Planning for Deep Embodied Agents [2.7195102129095003]
This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks.
It focuses on enhancing learning efficiency, interpretability, and transferability across novel scenarios.
arXiv Detail & Related papers (2024-09-28T23:05:56Z) - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images.
By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data.
Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z) - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents.
OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks.
We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - Data-CUBE: Data Curriculum for Instruction-based Sentence Representation
Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training.
In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk.
In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z) - A Zero-Shot Language Agent for Computer Control with Structured
Reflection [19.526676887048662]
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment.
To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting.
We approach this problem with a zero-shot agent that requires no given expert traces.
arXiv Detail & Related papers (2023-10-12T21:53:37Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Language Models can Solve Computer Tasks [13.914130729517584]
We show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme.
We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++.
arXiv Detail & Related papers (2023-03-30T16:01:52Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Environment Generation for Zero-Shot Compositional Reinforcement
Learning [105.35258025210862]
Compositional Design of Environments (CoDE) trains a Generator agent to automatically build a series of compositional tasks tailored to the agent's current skill level.
We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments.
CoDE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks.
arXiv Detail & Related papers (2022-01-21T21:35:01Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.