DERA: Enhancing Large Language Model Completions with Dialog-Enabled
Resolving Agents
- URL: http://arxiv.org/abs/2303.17071v1
- Date: Thu, 30 Mar 2023 00:30:19 GMT
- Title: DERA: Enhancing Large Language Model Completions with Dialog-Enabled
Resolving Agents
- Authors: Varun Nair, Elliot Schumacher, Geoffrey Tso, Anitha Kannan
- Abstract summary: Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks.
In this work, we present dialog-enabled resolving agents (DERA)
DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4.
It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output.
- Score: 5.562984399879218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have emerged as valuable tools for many natural
language understanding tasks. In safety-critical applications such as
healthcare, the utility of these models is governed by their ability to
generate outputs that are factually accurate and complete. In this work, we
present dialog-enabled resolving agents (DERA). DERA is a paradigm made
possible by the increased conversational abilities of LLMs, namely GPT-4. It
provides a simple, interpretable forum for models to communicate feedback and
iteratively improve output. We frame our dialog as a discussion between two
agent types - a Researcher, who processes information and identifies crucial
problem components, and a Decider, who has the autonomy to integrate the
Researcher's information and makes judgments on the final output.
We test DERA against three clinically-focused tasks. For medical conversation
summarization and care plan generation, DERA shows significant improvement over
the base GPT-4 performance in both human expert preference evaluations and
quantitative metrics. In a new finding, we also show that GPT-4's performance
(70%) on an open-ended version of the MedQA question-answering (QA) dataset
(Jin et al. 2021, USMLE) is well above the passing level (60%), with DERA
showing similar performance. We release the open-ended MEDQA dataset at
https://github.com/curai/curai-research/tree/main/DERA.
Related papers
- DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework.
DARA effectively parses questions into formal queries through a dual mechanism.
We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z) - Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs)
Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z) - The All-Seeing Project V2: Towards General Relation Comprehension of the Open World [58.40101895719467]
We present the All-Seeing Project V2, a new model and dataset designed for understanding object relations in images.
We propose the All-Seeing Model V2 that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation task.
Our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them.
arXiv Detail & Related papers (2024-02-29T18:59:17Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics
Capabilities [40.55743949223173]
Pragmatics Understanding Benchmark (PUB) is a dataset consisting of fourteen tasks in four pragmatics phenomena.
PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets.
Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models.
arXiv Detail & Related papers (2024-01-13T13:46:14Z) - PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging [8.043625583479598]
Multimodal large language models (MLLMs) represent an evolutionary expansion in the capabilities of traditional large language models.
Recent works investigate the adaptation of MLLMs as a universal solution to address medical multi-modal problems as a generative task.
We propose a parameter efficient framework for fine-tuning MLLMs, specifically validated on medical visual question answering (Med-VQA) and medical report generation (MRG) tasks.
arXiv Detail & Related papers (2024-01-05T13:22:12Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - FireAct: Toward Language Agent Fine-tuning [63.06306936820456]
We argue for the overlooked direction of fine-tuning LMs to obtain language agents.
Fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase.
We propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods.
arXiv Detail & Related papers (2023-10-09T17:58:38Z) - Goal Driven Discovery of Distributional Differences via Language
Descriptions [58.764821647036946]
Mining large corpora can generate useful discoveries but is time-consuming for humans.
We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way.
Our system produces discoveries previously unknown to the authors on a wide range of applications in OpenD5.
arXiv Detail & Related papers (2023-02-28T01:32:32Z) - Understanding the Effectiveness of Very Large Language Models on Dialog
Evaluation [20.18656308749408]
Large language models (LLMs) have been used for generation and can now output human-like text.
This paper investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
arXiv Detail & Related papers (2023-01-27T22:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.