Fugu-MT 論文翻訳(概要): SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

論文の概要: SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

arxiv url: http://arxiv.org/abs/2604.17771v1
Date: Mon, 20 Apr 2026 03:50:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.682208
Title: SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
Title（参考訳）: SPENCE: NL2SQLベンチマークで汚染を検出するための構文プローブ
Authors: Mohammadtaher Safarzadeh, Hitesh Laxmichand Patel, Afshin Orojlooyjadid, Graham Horwood, Dan Roth,
Abstract要約: 大規模言語モデル (LLM) は、自然言語 tosql (NL2) ベンチマークで高いパフォーマンスを達成した。報告された精度は、ベンチマーククエリからの汚染や、トレーニング中に見られる構造的に類似したパターンによって膨らませられる可能性がある。本研究では,このような汚染を検出・定量化するための統括型構文探索フレームワークであるSPENCEを紹介する。
参考スコア（独自算出の注目度）: 40.31493151791439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall's tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語からSQLへのベンチマーク(NL2SQL)において強力なパフォーマンスを達成したが、その報告された精度は、ベンチマーククエリからの汚染や、トレーニング中に見られる構造的に類似したパターンによって膨らませられる可能性がある。本研究では,その汚染を検出・定量化するための統括型構文探索フレームワークであるSPENCE(Syntactic Probing and Evaluation of NL2SQL Contamination Effects)を紹介する。 SPENCEは、広く使われている4つのNL2SQLデータセット(Spider、SParC、CoSQL、およびより新しいBIRDベンチマーク)に対して、テストクエリの構文的変種を体系的に生成する。我々はSPENCEを用いて複数の高容量LCMを実行ベーススコアリングで評価する。各モデルについて,構文的ばらつきの増大による実行精度の変化を測定し,Kendall's tauとブートストラップ信頼区間を用いたランク感度の定量化を行う。これらのロバストネストレンドをベンチマークリリース日と整合させることにより、明確な時間勾配を観察する。例えば、スパイダーのような古いベンチマークは、最大の負の値を示し、したがって、トレーニングリークの可能性が最も高いのに対して、最近のBIRDデータセットは、最小の感度を示し、ほとんど汚染されていないように見える。これらの知見は,信頼性の高いNL2SQLベンチマークにおける時間的文脈的・統語的評価の重要性を強調した。

論文の概要: SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

関連論文リスト