SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation
- URL: http://arxiv.org/abs/2509.25672v1
- Date: Tue, 30 Sep 2025 02:14:49 GMT
- Title: SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation
- Authors: Hasan Alp Caferoğlu, Mehmet Serhat Çelik, Özgür Ulusoy,
- Abstract summary: SING-a is a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-data.<n>SING-LM is a family of compact language models fine-tuned on the synthetic data.
- Score: 2.0799061948689306
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Translating natural language questions into SQL has become a core challenge in enabling non-technical users to query databases. While recent work has explored large-scale synthetic data generation to improve model performance through post-training, most efforts emphasize cross-domain generalization. This leaves a gap for real-world enterprise scenarios, where models need to specialize to a single database schema and organizations require to be able to evaluate their Text-to-SQL systems on their own databases. To address this, we introduce SING-SQL, a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database, without relying on SQL logs or manual annotations. Our approach hierarchically partitions a database schema into sub-schemas, synthesizes SQL queries across multiple complexity levels, and applies a quality-aware pipeline that includes LLM-as-a-judge validation, executability checks, automatic repair, and column balancing. We further release SingSQL-LM, a family of compact language models fine-tuned on the synthetic data, achieving strong in-domain generalization. On the subset of the BIRD benchmark, SingSQL-LM-3B-R64 reaches 82.87% Soft F1 and 73.03% EX upper bound with 32 candidates, outperforming the best 3B-scale baseline by +16.21 in Soft F1 and +12.36 in EX. At the 1.5B scale, SingSQL-LM-1.5B-R64 improves over prior systems by +9.30 in Soft F1 and +4.49 in EX. On synthetic evaluation sets, SingSQL-LMs exceed prior systems by wide margins, establishing state-of-the-art performance among open models at comparable scales. Our study of context management strategies reveals that schema-free fine-tuning combined with schema-only inference provides the most robust results. These findings establish SING-SQL as a scalable, database-agnostic paradigm for producing and evaluating enterprise-grade Text-to-SQL systems.
Related papers
- RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models [1.0062127381149395]
Ring is a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions.<n>We find that models trained by Ring achieve an average gain in accuracy of +2.3% across six text-to- benchmarks when compared to models trained on other synthetic data.
arXiv Detail & Related papers (2026-01-09T00:46:53Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction [5.123751486259634]
We introduce LitE-, a Lightweight and Efficient framework with two components.<n>On BIRD, LitE- achieves 72.10% execution accuracy, and on Spider it reaches 88.45%, demonstrating comparable or superior performance to Retriever.<n>Our findings demonstrate that high-quality Text-to-correction generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
arXiv Detail & Related papers (2025-10-10T05:27:47Z) - DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL [18.915121803834698]
We propose DB-Explore, a novel framework that systematically aligns large language models with database knowledge.<n>Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation.
arXiv Detail & Related papers (2025-03-06T20:46:43Z) - OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale [31.852909145101677]
We propose a novel and scalable text-to-data framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention.<n>We introduce Syn-2.5M, the first million-scale text-to-dataset, containing 2.5 million samples spanning over 16,000 synthetic databases.<n>We develop Omni, a powerful open-source text-to-model available in three sizes: 7B, 14B, and 32B.
arXiv Detail & Related papers (2025-03-04T03:30:56Z) - RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL [47.120862170230566]
Recent Text-to-yourself methods usually suffer from significant performance degradation on "huge" databases.<n>We introduce MAC, a novel Text-to-yourself LLM-based multi-agent collaborative framework.<n>In our framework, we leverage GPT-4 as the strong backbone for all agent tasks to determine the upper bound of our framework.<n>We then fine-tune an open-sourced instruction-followed model,sql-Llama, by leveraging Code 7B, to accomplish all tasks as GPT-4 does.
arXiv Detail & Related papers (2023-12-18T14:40:20Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.