Learning to Solve Complex Tasks by Talking to Agents
- URL: http://arxiv.org/abs/2110.08542v1
- Date: Sat, 16 Oct 2021 10:37:34 GMT
- Title: Learning to Solve Complex Tasks by Talking to Agents
- Authors: Tushar Khot and Kyle Richardson and Daniel Khashabi and Ashish
Sabharwal
- Abstract summary: Humans often solve complex problems by interacting with existing agents, such as AI assistants, that can solve simpler sub-tasks.
Common NLP benchmarks aim for the development of self-sufficient models for every task.
We propose a new benchmark called CommaQA that contains three kinds of complex reasoning tasks designed to be solved by talking'' to four agents with different capabilities.
- Score: 39.08818632689814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans often solve complex problems by interacting (in natural language) with
existing agents, such as AI assistants, that can solve simpler sub-tasks. These
agents themselves can be powerful systems built using extensive resources and
privately held data. In contrast, common NLP benchmarks aim for the development
of self-sufficient models for every task. To address this gap and facilitate
research towards ``green'' AI systems that build upon existing agents, we
propose a new benchmark called CommaQA that contains three kinds of complex
reasoning tasks that are designed to be solved by ``talking'' to four agents
with different capabilities. We demonstrate that state-of-the-art black-box
models, which are unable to leverage existing agents, struggle on CommaQA
(exact match score only reaches 40pts) even when given access to the agents'
internal knowledge and gold fact supervision. On the other hand, models using
gold question decomposition supervision can indeed solve CommaQA to a high
accuracy (over 96\% exact match) by learning to utilize the agents. Even these
additional supervision models, however, do not solve our compositional
generalization test set. Finally the end-goal of learning to solve complex
tasks by communicating with existing agents \emph{without relying on any
additional supervision} remains unsolved and we hope CommaQA serves as a novel
benchmark to enable the development of such systems.
Related papers
- Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - A Survey on Complex Tasks for Goal-Directed Interactive Agents [60.53915548970061]
This survey compiles relevant tasks and environments for evaluating goal-directed interactive agents.
An up-to-date compilation of relevant resources can be found on our project website.
arXiv Detail & Related papers (2024-09-27T08:17:53Z) - LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents [0.0]
We propose a novel LLM-based Agent Unified Modeling Framework (LLM-Agent-UMF)
Our framework distinguishes between the different components of an LLM-based agent, setting LLMs and tools apart from a new element, the core-agent.
We evaluate our framework by applying it to thirteen state-of-the-art agents, thereby demonstrating its alignment with their functionalities.
arXiv Detail & Related papers (2024-09-17T17:54:17Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning [12.80689911863731]
Sibyl is a powerful framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools.
Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach.
Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent achieves state-of-the-art performance with an average score of 34.55%.
arXiv Detail & Related papers (2024-07-15T13:45:40Z) - Adaptive In-conversation Team Building for Language Model Agents [33.03550687362213]
Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks.
Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent.
A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods.
arXiv Detail & Related papers (2024-05-29T18:08:37Z) - Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning [14.635361844362794]
Smurfs' is a cutting-edge multi-agent framework designed to revolutionize the application of large language models.
Smurfs can enhance the model's ability to solve complex tasks at no additional cost.
arXiv Detail & Related papers (2024-05-09T17:49:04Z) - Multi-Agent Consensus Seeking via Large Language Models [6.922356864800498]
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner.
This work considers a fundamental problem in multi-agent collaboration: consensus seeking.
arXiv Detail & Related papers (2023-10-31T03:37:11Z) - Towards Collaborative Question Answering: A Preliminary Study [63.91687114660126]
We propose CollabQA, a novel QA task in which several expert agents coordinated by a moderator work together to answer questions that cannot be answered with any single agent alone.
We make a synthetic dataset of a large knowledge graph that can be distributed to experts.
We show that the problem can be challenging without introducing prior to the collaboration structure, unless experts are perfect and uniform.
arXiv Detail & Related papers (2022-01-24T14:27:00Z) - UneVEn: Universal Value Exploration for Multi-Agent Reinforcement
Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn)
UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features.
Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.