Fugu-MT 論文翻訳(概要): NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

論文の概要: NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

arxiv url: http://arxiv.org/abs/2604.16493v1
Date: Mon, 13 Apr 2026 18:00:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.02749
Title: NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
Title（参考訳）: NL2SQLBench: LLM対応NL2SQLソリューションのためのモジュール型ベンチマークフレームワーク
Authors: Shizheng Hou, Wenqi Pei, Nuo Chen, Quang-Trung Ta, Peng Lu, Beng Chin Ooi,
Abstract要約: 大規模言語モデル(LLM)はNL2アルゴリズムを大幅に改善したが、その迅速な開発は体系的な評価よりも優れている。統一可能なNL2アプローチのための最初のモジュール評価およびベンチマークフレームワークであるNL2Benchを紹介する。評価の結果,既存のNL2法には大きなギャップがあり,精度の向上だけでなく,計算効率の低下も顕著であることがわかった。
参考スコア（独自算出の注目度）: 16.53346245559808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.
Abstract（参考訳）: Natural Language to SQL (NL2SQL)技術は、専門家でないユーザがSQLの専門知識を必要とせずにリレーショナルデータベースをクエリできるようにする。大きな言語モデル(LLM)はNL2SQLアルゴリズムを大幅に改善したが、その急速な開発は体系的な評価を上回り、その有効性、効率、限界を理解する上で重要なギャップを残している。この目的のために,LLM対応NL2SQLアプローチのための最初のモジュラー評価およびベンチマークフレームワークであるNL2SQLBenchを紹介する。具体的には,NL2SQLシステムを3つのコアモジュール – Schema Selection, Candidate Generation,Query Revision – に分割する。各モジュールについて,既存の戦略を包括的にレビューし,モジュールレベルの有効性と効率を体系的に定量化する,新たなきめ細かいメトリクスを提案する。さらに、これらのメトリクスを柔軟なマルチエージェントフレームワークで実装し、さまざまなNL2SQLアプローチで設定可能なベンチマークを可能にします。 NL2SQLBenchを活用することで,BIRD開発セットとScienceBenchmark開発セットの2つのデータセットに対して,DeepSeek-V3とGPT-4o miniの2つのLLMを用いて,10種類のオープンソースメソッドを厳格に評価する。 3つのコアモジュールにまたがる各アプローチを体系的に評価し、複数の重要なパフォーマンスの次元を評価する。評価の結果,既存のNL2SQL手法には大きなギャップがあり,精度の向上だけでなく,計算の非効率性も著しく向上し,現実の応用を著しく損なうことがわかった。さらに,本分析では,金のSQLアノテーションの不正確なアノテーションや既存の評価ルールの制限といった問題を強調することで,現在のベンチマークデータセットや評価ルールの重大な欠点を明らかにしている。これらの知見を統一的なベンチマークに合成することにより、公正比較のための明確な基準点を確立し、NL2SQL技術における将来の目標とするイノベーションのための重要なガイダンスとなる。

論文の概要: NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

関連論文リスト