Fugu-MT 論文翻訳(概要): SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

論文の概要: SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

arxiv url: http://arxiv.org/abs/2509.25672v1
Date: Tue, 30 Sep 2025 02:14:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.394145
Title: SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation
Title（参考訳）: SING-SQL: ドメイン内テキスト-SQL翻訳のための合成データ生成フレームワーク
Authors: Hasan Alp Caferoğlu, Mehmet Serhat Çelik, Özgür Ulusoy,
Abstract要約: SING-aは、高品質で高カバレッジな合成テキストデータを生成するための、完全に自動化された2段階のフレームワークである。 SING-LMは、合成データに基づいて微調整されたコンパクト言語モデルのファミリーである。
参考スコア（独自算出の注目度）: 2.0799061948689306
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Translating natural language questions into SQL has become a core challenge in enabling non-technical users to query databases. While recent work has explored large-scale synthetic data generation to improve model performance through post-training, most efforts emphasize cross-domain generalization. This leaves a gap for real-world enterprise scenarios, where models need to specialize to a single database schema and organizations require to be able to evaluate their Text-to-SQL systems on their own databases. To address this, we introduce SING-SQL, a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database, without relying on SQL logs or manual annotations. Our approach hierarchically partitions a database schema into sub-schemas, synthesizes SQL queries across multiple complexity levels, and applies a quality-aware pipeline that includes LLM-as-a-judge validation, executability checks, automatic repair, and column balancing. We further release SingSQL-LM, a family of compact language models fine-tuned on the synthetic data, achieving strong in-domain generalization. On the subset of the BIRD benchmark, SingSQL-LM-3B-R64 reaches 82.87% Soft F1 and 73.03% EX upper bound with 32 candidates, outperforming the best 3B-scale baseline by +16.21 in Soft F1 and +12.36 in EX. At the 1.5B scale, SingSQL-LM-1.5B-R64 improves over prior systems by +9.30 in Soft F1 and +4.49 in EX. On synthetic evaluation sets, SingSQL-LMs exceed prior systems by wide margins, establishing state-of-the-art performance among open models at comparable scales. Our study of context management strategies reveals that schema-free fine-tuning combined with schema-only inference provides the most robust results. These findings establish SING-SQL as a scalable, database-agnostic paradigm for producing and evaluating enterprise-grade Text-to-SQL systems.
Abstract（参考訳）: 自然言語の質問をSQLに翻訳することは、非技術者のユーザがデータベースをクエリできるようにする上で、大きな課題となっている。最近の研究は、学習後のモデル性能を改善するために大規模な合成データ生成を探求しているが、ほとんどの取り組みはドメイン間の一般化を強調している。モデルが単一のデータベーススキーマに特化する必要があるし、組織が自身のデータベース上でText-to-SQLシステムを評価する必要がある。これを解決するために、私たちは、SQLログや手動のアノテーションに頼ることなく、あらゆるターゲットデータベースに対して高品質で高カバレッジのテキスト-SQLデータを生成するための、完全に自動化された2段階フレームワークであるSING-SQLを紹介します。このアプローチでは、データベーススキーマをサブスキーマに階層的に分割し、複数の複雑性レベルにわたってSQLクエリを合成し、LCM-as-a-judgeバリデーション、実行可能性チェック、自動修復、カラムバランスを含む品質に配慮したパイプラインを適用します。我々はさらに、合成データに基づいて微調整されたコンパクト言語モデルのファミリーであるSingSQL-LMをリリースし、強力なドメイン内一般化を実現する。 BIRDベンチマークのサブセットでは、SingSQL-LM-3B-R64は82.87%のソフトF1と73.03%のオーバーバウンドを持ち、32の候補を持つ。 1.5Bスケールでは、SingSQL-LM-1.5B-R64は、Soft F1では+9.30、EXでは+4.49に改善されている。合成評価セットでは、SingSQL-LMは従来のシステムよりも広いマージンで、同等のスケールでオープンモデル間の最先端の性能を確立する。我々の文脈管理戦略の研究は、スキーマフリーな微調整とスキーマオンリーの推論を組み合わせることで、最も堅牢な結果が得られることを示している。これらの知見は、エンタープライズグレードのText-to-SQLシステムの開発と評価のためのスケーラブルでデータベースに依存しないパラダイムとしてSING-SQLを確立している。

論文の概要: SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

関連論文リスト