Related papers: Learning to Solve Complex Tasks by Talking to Agents

Learning to Solve Complex Tasks by Talking to Agents

URL: http://arxiv.org/abs/2110.08542v1
Date: Sat, 16 Oct 2021 10:37:34 GMT
Title: Learning to Solve Complex Tasks by Talking to Agents
Authors: Tushar Khot and Kyle Richardson and Daniel Khashabi and Ashish Sabharwal
Abstract summary: Humans often solve complex problems by interacting with existing agents, such as AI assistants, that can solve simpler sub-tasks. Common NLP benchmarks aim for the development of self-sufficient models for every task. We propose a new benchmark called CommaQA that contains three kinds of complex reasoning tasks designed to be solved by talking'' to four agents with different capabilities.
Score: 39.08818632689814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans often solve complex problems by interacting (in natural language) with existing agents, such as AI assistants, that can solve simpler sub-tasks. These agents themselves can be powerful systems built using extensive resources and privately held data. In contrast, common NLP benchmarks aim for the development of self-sufficient models for every task. To address this gap and facilitate research towards ``green'' AI systems that build upon existing agents, we propose a new benchmark called CommaQA that contains three kinds of complex reasoning tasks that are designed to be solved by ``talking'' to four agents with different capabilities. We demonstrate that state-of-the-art black-box models, which are unable to leverage existing agents, struggle on CommaQA (exact match score only reaches 40pts) even when given access to the agents' internal knowledge and gold fact supervision. On the other hand, models using gold question decomposition supervision can indeed solve CommaQA to a high accuracy (over 96\% exact match) by learning to utilize the agents. Even these additional supervision models, however, do not solve our compositional generalization test set. Finally the end-goal of learning to solve complex tasks by communicating with existing agents \emph{without relying on any additional supervision} remains unsolved and we hope CommaQA serves as a novel benchmark to enable the development of such systems.

Related papers

From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario [3.5262044630932254]
Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints.<n>Despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited.
arXiv Detail & Related papers (2025-08-06T17:54:10Z)
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions [51.96890647837277]
Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence.
arXiv Detail & Related papers (2025-04-07T21:01:25Z)
D-CIPHER: Dynamic Collaborative Intelligent Agents with Planning and Heterogeneous Execution for Enhanced Reasoning in Offensive Security [22.86304661035188]
Large Language Models (LLMs) have been used in cybersecurity in many ways. Capture the Flag (CTF) challenges serve as benchmarks for assessing the automated task-planning abilities of LLM agents. We introduce the D-CIPHER multi-agent LLM framework for collaborative CTF challenge solving.
arXiv Detail & Related papers (2025-02-15T23:43:18Z)
Preventing Rogue Agents Improves Multi-Agent Collaboration [21.955058255432974]
We propose to monitor agents during action prediction and intervene when a future error is likely to occur.<n>Experiments on WhoDunitEnv, code generation tasks and the GovSim environment for resource sustainability show that our approach leads to substantial performance gains.
arXiv Detail & Related papers (2025-02-09T18:35:08Z)
Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z)
Agent-Oriented Planning in Multi-Agent Systems [54.429028104022066]
We propose AOP, a novel framework for agent-oriented planning in multi-agent systems. In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy. Extensive experiments demonstrate the advancement of AOP in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.
arXiv Detail & Related papers (2024-10-03T04:07:51Z)
A Survey on Complex Tasks for Goal-Directed Interactive Agents [60.53915548970061]
This survey compiles relevant tasks and environments for evaluating goal-directed interactive agents. An up-to-date compilation of relevant resources can be found on our project website.
arXiv Detail & Related papers (2024-09-27T08:17:53Z)
LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents [0.0]
We propose a novel LLM-based Agent Unified Modeling Framework (LLM-Agent-UMF) Our framework distinguishes between the different components of an LLM-based agent, setting LLMs and tools apart from a new element, the core-agent. We evaluate our framework by applying it to thirteen state-of-the-art agents, thereby demonstrating its alignment with their functionalities.
arXiv Detail & Related papers (2024-09-17T17:54:17Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning [12.80689911863731]
Sibyl is a powerful framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools. Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach. Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent achieves state-of-the-art performance with an average score of 34.55%.
arXiv Detail & Related papers (2024-07-15T13:45:40Z)
Adaptive In-conversation Team Building for Language Model Agents [33.03550687362213]
Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks. Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods.
arXiv Detail & Related papers (2024-05-29T18:08:37Z)
Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning [14.635361844362794]
Smurfs' is a cutting-edge multi-agent framework designed to revolutionize the application of large language models. Smurfs can enhance the model's ability to solve complex tasks at no additional cost.
arXiv Detail & Related papers (2024-05-09T17:49:04Z)
Multi-Agent Consensus Seeking via Large Language Models [6.922356864800498]
Multi-agent systems driven by large language models (LLMs) have shown promising abilities for solving complex tasks in a collaborative manner. This work considers a fundamental problem in multi-agent collaboration: consensus seeking.
arXiv Detail & Related papers (2023-10-31T03:37:11Z)
Towards Collaborative Question Answering: A Preliminary Study [63.91687114660126]
We propose CollabQA, a novel QA task in which several expert agents coordinated by a moderator work together to answer questions that cannot be answered with any single agent alone. We make a synthetic dataset of a large knowledge graph that can be distributed to experts. We show that the problem can be challenging without introducing prior to the collaboration structure, unless experts are perfect and uniform.
arXiv Detail & Related papers (2022-01-24T14:27:00Z)
UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn) UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.