Fugu-MT 論文翻訳(概要): GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models

論文の概要: GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models

arxiv url: http://arxiv.org/abs/2509.01308v2
Date: Wed, 29 Oct 2025 14:09:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-30 15:50:44.080466
Title: GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models
Title（参考訳）: GradeSQL: 大規模言語モデルからテキストからSQLを生成するためのアウトカムリワードモデルによるテスト時間推論
Authors: Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, Tommaso Di Noia,
Abstract要約: Outcome Reward Models(ORM)は、意味的正確性に基づいて生成された出力にユーティリティスコアを割り当てます。我々は、Text-to-Spiderタスクに適したORMをトレーニングするための統一的なフレームワークを提案する。
参考スコア（独自算出の注目度）: 16.184651199160882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.
Abstract（参考訳）: 自然言語の質問をSQLクエリに翻訳するタスクであるText-to-SQLは、LLM(Large Language Models)の導入によって大幅に進歩し、広範囲のユーザに対するデータベースアクセシビリティが向上した。有効なSQLの生成にはかなりの進歩があったが、現在のLLMは複雑なクエリに苦戦している。この制限に対処するために、LLMが複数の試みの後に正しい回答を得られるという仮定に基づいて、Best-of-N (BoN) やMajority Voting (Maj) といったテストタイム戦略がよく用いられる。しかし、これらの手法は表面的なヒューリスティックスに依存しており、実行ベースのBoN (ex-BoN) や、Majority Voting (Majority Voting) を通じて最も頻繁に生成されるクエリによって、構文的に正しいクエリを選択する。近年,モデルアライメントを改善するための有望な強化学習手法として,意味的正当性に基づく出力にユーティリティスコアを割り当てるアウトカム・リワード・モデル (ORM) が登場している。 ORMは効果的な新しいテストタイムヒューリスティックとして機能する可能性がある、と私たちは論じています。本研究では,Text-to-SQLタスクに適したORMをトレーニングするための統一フレームワークを提案し,BoN戦略におけるテストタイムヒューリスティックとしての有効性を評価する。 Qwen2、Granite3、Llama3ファミリのさまざまなオープンソースLLMを微調整して、BIRDとSpiderデータセットにまたがって、元BoNとMagに対してORMをベンチマークします。その結果、ORMは元BoNとMagより優れており、実行精度は前BoNより+4.33%(BIRD)、+2.10%(Spider)、Magより+2.91%(BIRD)、+0.93%(Spider)となっている。さらに,従来のBoNやMagと比較して,ORMが単純なクエリに対して競合的な結果を得ると同時に,候補数の増加によるメリットも期待できる。

論文の概要: GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models

関連論文リスト