RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models
- URL: http://arxiv.org/abs/2601.05451v1
- Date: Fri, 09 Jan 2026 00:46:53 GMT
- Title: RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models
- Authors: Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond,
- Abstract summary: Ring is a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions.<n>We find that models trained by Ring achieve an average gain in accuracy of +2.3% across six text-to- benchmarks when compared to models trained on other synthetic data.
- Score: 1.0062127381149395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu-c3lab/RingSQL.
Related papers
- SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas [2.905751301655124]
Key bottleneck for developing text-to-hugging models is lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity.<n>We introduce SQaLe: a large-scale semi-synthetic text-to-hugging dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas,Pile.<n>SQaLe captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity.
arXiv Detail & Related papers (2025-12-16T09:15:10Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation [2.0799061948689306]
SING-a is a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-data.<n>SING-LM is a family of compact language models fine-tuned on the synthetic data.
arXiv Detail & Related papers (2025-09-30T02:14:49Z) - Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation [26.834687657847454]
Text-to-sql models are increasingly adopted in real-world applications.<n> deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications.<n>We find that existing text-to-sql models experience significant performance drops when applied to new schemas.<n> Continuously obtaining high-quality text-to-sql data for evolving schemas is prohibitively expensive in real-world scenarios.
arXiv Detail & Related papers (2025-02-21T22:32:35Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation [10.205010004198757]
Text-to-generation enables non-experts to interact with databases via natural language.<n>Recent advances on large closed-source models like GPT-4 present challenges in accessibility, privacy, and latency.<n>We focus on developing small, efficient, and open-source text-to-generation models.
arXiv Detail & Related papers (2024-10-16T18:03:24Z) - SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging [30.306023265985658]
We introduce a framework for generating high-quality synthetic training data for any dialect.
We propose a novel Mixture-of-Experts (MoE) that leverages the shared knowledge across dialects.
arXiv Detail & Related papers (2024-08-22T20:50:48Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.