Related papers: BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

URL: http://arxiv.org/abs/2511.04153v1
Date: Thu, 06 Nov 2025 08:00:15 GMT
Title: BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation
Authors: Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, Amin Ahsan Ali,
Abstract summary: Existing Large Language Models (LLM) struggle withsql generation from natural instructions due to large schema sizes and complex reasoning.<n>In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models.<n>Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion.
Score: 3.2476501707160543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.

Related papers

LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting [7.590911146338215]
We propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV)<n>We build on insights from the SSEV pipeline to address the growing complexity of enterprise databases and real-world Text-to-Act tasks.<n>ReCAPAgent-5.5% integrates specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation.
arXiv Detail & Related papers (2026-01-25T18:38:58Z)
From Queries to Insights: Agentic LLM Pipelines for Spatio-Temporal Text-to-SQL [8.496933324334167]
We present a naive text-to-Act baseline (Rellama-sqlcoder-8b) with orchestration by a Mistral-based Rellama-sqlcoder-8b.<n>We evaluate on 35 natural-language queries over the NYC and Tokyo check-in, covering spatial, temporal multi-dataset reasoning.<n>The agent achieves substantially higher accuracy than the dataset 91.4% vs. 28.6% and enhances usability through maps, and plots structured natural-language summaries.
arXiv Detail & Related papers (2025-10-29T22:18:57Z)
AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation [0.509780930114934]
AGENTIQL is an agent-inspired framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection.<n>We evaluate AGENTIQL on the Spider benchmark, achieving up to 86.07% EX with 14B models using the Planner&Executor merging strategy.<n>Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.
arXiv Detail & Related papers (2025-10-12T15:35:05Z)
SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation [2.0799061948689306]
SING-a is a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-data.<n>SING-LM is a family of compact language models fine-tuned on the synthetic data.
arXiv Detail & Related papers (2025-09-30T02:14:49Z)
DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction [46.422626657078666]
We present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs.<n>We propose DeKey, a RAG-based NL2 pipeline that employs three separate modules for user question understanding, entity retrieval, and generation.
arXiv Detail & Related papers (2025-09-18T00:47:56Z)
HI-SQL: Optimizing Text-to-SQL Systems through Dynamic Hint Integration [1.3927943269211591]
Text-to-generation bridges the gap between natural language and databases, enabling users to query data without requiringsql expertise.<n>We propose HI-the, a pipeline that incorporates a novel hint generation mechanism utilizing historical query logs.<n>By analyzing prior queries, our method generates contextual hints that focus on handling the complexities of multi-table and nested operations.<n>Our approach significantly improves query accuracy of LLM-generated queries while ensuring efficiency in terms of calls and latency.
arXiv Detail & Related papers (2025-06-11T12:07:55Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
CHESS: Contextual Harnessing for Efficient SQL Synthesis [1.9506402593665235]
We introduce CHESS, a framework for efficient and scalable text-to- queries. It comprises four specialized agents, each targeting one of the aforementioned challenges. Our framework offers features that adapt to various deployment constraints.
arXiv Detail & Related papers (2024-05-27T01:54:16Z)
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL [47.120862170230566]
Recent Text-to-yourself methods usually suffer from significant performance degradation on "huge" databases.<n>We introduce MAC, a novel Text-to-yourself LLM-based multi-agent collaborative framework.<n>In our framework, we leverage GPT-4 as the strong backbone for all agent tasks to determine the upper bound of our framework.<n>We then fine-tune an open-sourced instruction-followed model,sql-Llama, by leveraging Code 7B, to accomplish all tasks as GPT-4 does.
arXiv Detail & Related papers (2023-12-18T14:40:20Z)
DBCopilot: Natural Language Querying over Massive Databases via Schema Routing [47.009638761948466]
We present DBCopilot, a framework that addresses challenges by employing a compact and flexible copilot model for routing over massive databases.<n>This framework utilizes a single lightweight differentiable search index to construct semantic mappings for massive database schemata, and navigates natural language questions to their target databases and tables in a relation joint retrieval manner.
arXiv Detail & Related papers (2023-12-06T12:37:28Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)
Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences. Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.