Related papers: On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

URL: http://arxiv.org/abs/2311.09721v1
Date: Thu, 16 Nov 2023 09:55:07 GMT
Title: On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering
Authors: Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, Arman Cohan
Abstract summary: This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models interact with a database. The task requires LLMs to strategically generate multiplesql queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction.
Score: 25.57202500348071
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.

Related papers

New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. We introduce a new REC dataset with two key features. First, it is designed with controllable difficulty levels, requiring fine-grained reasoning across object categories, attributes, and relationships. Second, it incorporates negative text and images generated through fine-grained editing, explicitly testing a model's ability to reject non-existent targets.
arXiv Detail & Related papers (2025-02-27T13:58:44Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension [8.489816179329832]
We present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of large language models (LLMs) in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters.
arXiv Detail & Related papers (2024-11-29T06:48:13Z)
SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine Interactions [3.5725872564627785]
In real-world situations, users often input contextual and highly personalized queries to chatbots. Previous research has not focused specifically on authentic human-machine dialogue scenarios. To address these gaps, we propose a Strategy-based Search Agent (SRSA) SRSA routing different queries to appropriate search strategies and enabling fine-grained serial searches to obtain high-quality results at a relatively low cost.
arXiv Detail & Related papers (2024-11-21T20:41:55Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely [8.507599833330346]
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Retrieval-Augmented Generation (RAG) and fine-tuning are gaining increasing attention and widespread application. However, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges.
arXiv Detail & Related papers (2024-09-23T11:20:20Z)
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation [19.312330150540912]
An emerging application is using Large Language Models (LLMs) to enhance retrieval-augmented generation (RAG) capabilities. We propose FRAMES, a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval.
arXiv Detail & Related papers (2024-09-19T17:52:07Z)
Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models [9.914489049993495]
We introduce Interactive-T2S, a framework that generatessql queries through direct interactions with databases. We have developed detailed exemplars to demonstrate the step-wise reasoning processes within our framework. Our experiments on the BIRD-Dev dataset, employing a setting without oracle knowledge, reveal that our method achieves state-of-the-art results with only two exemplars.
arXiv Detail & Related papers (2024-08-09T07:43:21Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. TACT contains challenging instructions that demand stitching information scattered across one or more texts. We construct this dataset by leveraging an existing dataset of texts and their associated tables. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues [88.73739515457116]
We introduce four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination. We jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner. Experiment results indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection.
arXiv Detail & Related papers (2020-09-14T08:44:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.