Related papers: DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

URL: http://arxiv.org/abs/2303.17071v1
Date: Thu, 30 Mar 2023 00:30:19 GMT
Title: DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
Authors: Varun Nair, Elliot Schumacher, Geoffrey Tso, Anitha Kannan
Abstract summary: Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In this work, we present dialog-enabled resolving agents (DERA) DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output.
Score: 5.562984399879218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. In this work, we present dialog-enabled resolving agents (DERA). DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. We frame our dialog as a discussion between two agent types - a Researcher, who processes information and identifies crucial problem components, and a Decider, who has the autonomy to integrate the Researcher's information and makes judgments on the final output. We test DERA against three clinically-focused tasks. For medical conversation summarization and care plan generation, DERA shows significant improvement over the base GPT-4 performance in both human expert preference evaluations and quantitative metrics. In a new finding, we also show that GPT-4's performance (70%) on an open-ended version of the MedQA question-answering (QA) dataset (Jin et al. 2021, USMLE) is well above the passing level (60%), with DERA showing similar performance. We release the open-ended MEDQA dataset at https://github.com/curai/curai-research/tree/main/DERA.

Related papers

Integrating External Tools with Large Language Models to Improve Accuracy [0.0]
It is well-known that without relevant contextual information, large language models (LLMs) can provide poor quality responses or tend to hallucinate.<n>Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy.<n>In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings.
arXiv Detail & Related papers (2025-07-09T04:09:59Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism. We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z)
Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs) Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models. We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z)
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World [58.40101895719467]
We present the All-Seeing Project V2, a new model and dataset designed for understanding object relations in images. We propose the All-Seeing Model V2 that integrates the formulation of text generation, object localization, and relation comprehension into a relation conversation task. Our model excels not only in perceiving and recognizing all objects within the image but also in grasping the intricate relation graph between them.
arXiv Detail & Related papers (2024-02-29T18:59:17Z)
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format. Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z)
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities [40.55743949223173]
Pragmatics Understanding Benchmark (PUB) is a dataset consisting of fourteen tasks in four pragmatics phenomena. PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models.
arXiv Detail & Related papers (2024-01-13T13:46:14Z)
PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging [8.043625583479598]
Multimodal large language models (MLLMs) represent an evolutionary expansion in the capabilities of traditional large language models. Recent works investigate the adaptation of MLLMs as a universal solution to address medical multi-modal problems as a generative task. We propose a parameter efficient framework for fine-tuning MLLMs, specifically validated on medical visual question answering (Med-VQA) and medical report generation (MRG) tasks.
arXiv Detail & Related papers (2024-01-05T13:22:12Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
FireAct: Toward Language Agent Fine-tuning [63.06306936820456]
We argue for the overlooked direction of fine-tuning LMs to obtain language agents. Fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. We propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods.
arXiv Detail & Related papers (2023-10-09T17:58:38Z)
ChatGPT as your Personal Data Scientist [0.9689893038619583]
This paper introduces a ChatGPT-based conversational data-science framework to act as a "personal data scientist" Our model pivots around four dialogue states: Data visualization, Task Formulation, Prediction Engineering, and Result Summary and Recommendation. In summary, we developed an end-to-end system that not only proves the viability of the novel concept of conversational data science but also underscores the potency of LLMs in solving complex tasks.
arXiv Detail & Related papers (2023-05-23T04:00:16Z)
Goal Driven Discovery of Distributional Differences via Language Descriptions [58.764821647036946]
Mining large corpora can generate useful discoveries but is time-consuming for humans. We formulate a new task, D5, that automatically discovers differences between two large corpora in a goal-driven way. Our system produces discoveries previously unknown to the authors on a wide range of applications in OpenD5.
arXiv Detail & Related papers (2023-02-28T01:32:32Z)
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation [20.18656308749408]
Large language models (LLMs) have been used for generation and can now output human-like text. This paper investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
arXiv Detail & Related papers (2023-01-27T22:02:27Z)
Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs. We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.