Fugu-MT 論文翻訳(概要): Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

論文の概要: Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

arxiv url: http://arxiv.org/abs/2510.26495v1
Date: Thu, 30 Oct 2025 13:44:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.8411
Title: Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Title（参考訳）: テキストからSQLへの再考: リアルタイムデータベース探索のための動的マルチターンSQLインタラクション
Authors: Linzhuang Sun, Tianyu Guo, Hao Liang, Yuying Li, Qifeng Cai, Jingxuan Wei, Bihui Yu, Wentao Zhang, Bin Cui,
Abstract要約: 進化するユーザインタラクションの下でモデル性能を評価するベンチマークであるDy-Benchを紹介する。以前の手動でキュレートされたデータセットとは異なり、Dylz-Benchはタスクと検証の2段階の自動パイプラインを通じて構築される。 Dy-BenchはBIRDとSpider 2データベースにまたがる13のドメインをカバーしており、合計1,072のタスクがある。
参考スコア（独自算出の注目度）: 21.94739453628141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark's difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
Abstract（参考訳）: Text-to-SQLの最近の進歩は、モデルが自然言語の質問からSQLクエリを生成する静的なシングルターンタスクにおいて、大きな成果を上げている。しかし、これらのシステムは、ユーザの意図が進化し、クエリが複数回にわたって洗練されなければならない現実世界の対話シナリオでは不足している。金融やビジネス分析などのアプリケーションでは、ユーザは中間結果に基づいてクエリ制約や次元を反復的に調整する。このような動的機能を評価するために、進化するユーザインタラクションの下でモデル性能を評価するベンチマークであるDySQL-Benchを紹介する。以前の手動でキュレートされたデータセットとは異なり、DySQL-Benchはタスク合成と検証の自動化された2段階パイプラインを通じて構築されている。生のデータベーステーブルから得られる構造木表現は、LLMベースのタスク生成をガイドし、その後にインタラクション指向のフィルタリングと専門家による検証を行う。人間の評価は、合成データの100%の正確性を確認する。また、LLMシミュレーションユーザ間の現実的なインタラクションをシミュレートするマルチターン評価フレームワーク、テスト中のモデル、実行可能データベースを提案する。ユーザ意図が変わるにつれて、モデルは推論とSQL生成に適応する必要があります。 DySQL-BenchはBIRDとSpider 2データベースにまたがる13のドメインをカバーする。 GPT-4oでさえ、全体的な精度は58.34%、Pass@5の23.81%に過ぎず、ベンチマークの難しさを暗示している。すべてのコードとデータはhttps://github.com/Aurora-slz/Real-World-SQL-Benchで公開されている。

論文の概要: Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

関連論文リスト