Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
- URL: http://arxiv.org/abs/2509.24403v3
- Date: Wed, 01 Oct 2025 02:55:56 GMT
- Title: Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling
- Authors: Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, Ying Yan,
- Abstract summary: State-of-the-art (SOTA) text-to-the-art methods still lag significantly behind human experts on challenging computation benchmarks like BIRD.<n>Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process.
- Score: 11.577572131517714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
Related papers
- Agentic Test-Time Scaling for WebAgents [65.5178428849495]
We present Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious.<n>CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling.
arXiv Detail & Related papers (2026-02-12T18:58:30Z) - APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL [39.76924093980244]
APEX- verbalize is a framework that shifts the paradigm from passive translation to agentic exploration.<n>Our framework employs a hypothesis-verification loop to ground model reasoning in real data.
arXiv Detail & Related papers (2026-02-11T07:50:47Z) - LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting [7.590911146338215]
We propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV)<n>We build on insights from the SSEV pipeline to address the growing complexity of enterprise databases and real-world Text-to-Act tasks.<n>ReCAPAgent-5.5% integrates specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation.
arXiv Detail & Related papers (2026-01-25T18:38:58Z) - SQL-Trail: Multi-Turn Reinforcement Learning with Interleaved Feedback for Text-to-SQL [20.49395306069103]
We introduce a multi-turn reinforcement learning (RL) agentic framework for Text-to-one generation.<n>Rather than producing a query in one shot, SQL-Trail interacts with the database environment and uses execution feedback to iteratively refine its predictions.<n>Our approach centers on two key ideas: (i) an adaptive turn-budget allocation mechanism that scales the agent's interaction depth to match question difficulty, and (ii) a composite reward panel that jointly incentivizessql correctness and efficient exploration.
arXiv Detail & Related papers (2026-01-25T05:16:52Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks [21.891522433628893]
Large language models (LLMs) are increasingly powering Text-to- (Text2) systems, enabling non-expert users to query industrial databases using natural language.<n>While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain.<n>This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2 systems.
arXiv Detail & Related papers (2025-10-13T01:29:54Z) - HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance [6.653834890554154]
We present HES-, a novel hybrid training framework that advances Text-to-latency generation through the integration of thinking-mode-fused supervised fine-tuning.<n>This framework enables switch between reasoning and non-reasoning modes while improving query accuracy and execution efficiency.
arXiv Detail & Related papers (2025-10-10T01:15:57Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - Scaling Test-time Compute for LLM Agents [51.790752085445384]
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs)<n>In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents.
arXiv Detail & Related papers (2025-06-15T17:59:47Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward [15.448159172903138]
Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to- tasks.<n>Existing methods often rely on execution-based or LLM-based Bradley-Terry reward models.<n>We propose a novel Text-to- RL fine-tuning framework named Graph-Reward-Reward, which employs the GMNScore outcome reward model.
arXiv Detail & Related papers (2025-05-18T11:53:01Z) - Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning [0.12289361708127876]
This work reframes the Text-to-the-task as a pathway for teaching large language models (LLMs) to reason over and manipulate data.<n>We propose a two-stage framework that teaches a model how to traverse, filter, and aggregate table fields.<n> Empirically, our approach achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA.
arXiv Detail & Related papers (2025-04-23T19:02:04Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment [6.2089733671434875]
We propose OpenSearch-, which divides the Text-to-agent task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on consistency alignment mechanism.<n>These methods have significantly improved the performance of LLMs in the Text-to-agent task.<n> Experimental results show that OpenSearch- achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based efficiency score (R-VES) of 69.3, with all three metrics ranking first at the time of submission.
arXiv Detail & Related papers (2025-02-19T07:51:50Z) - Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL [13.122218546167463]
Large language models (LLMs) have significantly improved the performance of text-to- systems.<n>Many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness.
arXiv Detail & Related papers (2024-12-17T04:22:22Z) - SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting [67.97870844244187]
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image.<n>We propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (Net)<n>It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size.
arXiv Detail & Related papers (2023-11-16T16:50:56Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.