SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
- URL: http://arxiv.org/abs/2602.22223v1
- Date: Tue, 16 Dec 2025 09:15:10 GMT
- Title: SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
- Authors: Cornelius Wolff, Daniel Gomm, Madelon Hulsebos,
- Abstract summary: Key bottleneck for developing text-to-hugging models is lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity.<n>We introduce SQaLe: a large-scale semi-synthetic text-to-hugging dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas,Pile.<n>SQaLe captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity.
- Score: 2.905751301655124
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: https://huggingface.co/datasets/trl-lab/SQaLe-text-to-SQL-dataset.
Related papers
- RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models [1.0062127381149395]
Ring is a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions.<n>We find that models trained by Ring achieve an average gain in accuracy of +2.3% across six text-to- benchmarks when compared to models trained on other synthetic data.
arXiv Detail & Related papers (2026-01-09T00:46:53Z) - EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis [25.689983072200047]
Evol is a structure-aware data synthesis framework that evolves queries into richer and more semantically diverse forms.<n>A 7B model outperforms one trained on the much larger Syn dataset using only 1/18 of the data.
arXiv Detail & Related papers (2026-01-08T12:19:50Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification [50.59009084277447]
We introduce UNJOIN, a framework that decouples the retrieval of schema elements from logic generation.<n>In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name.<n>In the second stage, the query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic.
arXiv Detail & Related papers (2025-05-23T17:28:43Z) - OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale [31.852909145101677]
We propose a novel and scalable text-to-data framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention.<n>We introduce Syn-2.5M, the first million-scale text-to-dataset, containing 2.5 million samples spanning over 16,000 synthetic databases.<n>We develop Omni, a powerful open-source text-to-model available in three sizes: 7B, 14B, and 32B.
arXiv Detail & Related papers (2025-03-04T03:30:56Z) - Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation [26.834687657847454]
Text-to-sql models are increasingly adopted in real-world applications.<n> deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications.<n>We find that existing text-to-sql models experience significant performance drops when applied to new schemas.<n> Continuously obtaining high-quality text-to-sql data for evolving schemas is prohibitively expensive in real-world scenarios.
arXiv Detail & Related papers (2025-02-21T22:32:35Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Prompting GPT-3.5 for Text-to-SQL with De-semanticization and Skeleton
Retrieval [17.747079214502673]
Text-to- is a task that converts a natural language question into a structured query language () to retrieve information from a database.
In this paper, we propose an LLM-based framework for Text-to- which retrieves helpful demonstration examples to prompt LLMs.
We design a de-semanticization mechanism that extracts question skeletons, allowing us to retrieve similar examples based on their structural similarity.
arXiv Detail & Related papers (2023-04-26T06:02:01Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.