Fugu-MT 論文翻訳(概要): GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

論文の概要: GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

arxiv url: http://arxiv.org/abs/2509.01308v1
Date: Mon, 01 Sep 2025 09:47:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.626609
Title: GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models
Title（参考訳）: GradeSQL: 大規模言語モデルからSQLクエリをランク付けするためのアウトカムリワードモデル
Authors: Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, Tommaso Di Noia,
Abstract要約: Outcome Reward Models(ORM)は、意味的正確性に基づいて生成された出力にユーティリティスコアを割り当てます。我々はORMをBest-of-N(BoN)とMajority Voting(Maj)の効果的なアプローチとして評価する。我々は、Text-to-SpiderタスクのためのORMをトレーニングするためのフレームワークを紹介します。
参考スコア（独自算出の注目度）: 16.184651199160882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries that require precise alignment between user intent and the database schema. To mitigate this, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can generate correct answers but may require multiple attempts. However, these methods rely on surface-level heuristics, selecting either the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated query with Maj. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising approach for better aligning model predictions with user intent. Nevertheless, their application to Text-to-SQL remains largely underexplored. In this work, we evaluate ORMs as an effective heuristic for BoN, compare them with ex-BoN and Maj, and introduce a framework for training ORMs for the Text-to-SQL task. We evaluate our ORMs on the BIRD and SPIDER benchmarks, finetuning various open-source LLMs, including the Qwen2, Granite3, and Llama3 model families. Our results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.
Abstract（参考訳）: 自然言語の質問をSQLクエリに翻訳するタスクであるText-to-SQLは、LLM(Large Language Models)の導入によって大幅に進歩し、広範囲のユーザに対するデータベースアクセシビリティが向上した。有効なSQLの生成が大幅に進歩しているにも関わらず、現在のLLMは、ユーザ意図とデータベーススキーマの正確なアライメントを必要とする複雑なクエリに苦慮している。これを軽減するために、LLMが正しい答えを生成できるが複数の試行を必要とするという仮定に基づいて、Best-of-N (BoN) やMajority Voting (Maj) のようなテストタイム戦略がよく用いられる。しかし、これらの手法は表面的なヒューリスティックスに依存しており、実行ベースのBoN(ex-BoN)を介して構文的に正しいクエリを選択するか、Magとの最も頻繁に生成されるクエリを選択する。最近、意味的正確性に基づいて生成された出力にユーティリティスコアを割り当てる Outcome Reward Models (ORM) が、モデル予測とユーザ意図との整合性を改善するための有望なアプローチとして登場した。それでも、Text-to-SQLへの彼らの適用は、ほとんど調査されていない。本研究では、BONの効果的なヒューリスティックとしてORMを評価し、それらを元BoNやMagと比較し、Text-to-SQLタスクのためにORMをトレーニングするためのフレームワークを導入する。 BIRDおよびSPIDERベンチマークでORMを評価し、Qwen2、Granite3、Llama3モデルファミリなど、さまざまなオープンソースLLMを微調整する。我々の結果は、ORMが元BoNとMagより優れていることを示し、その実行精度は、元BoNよりも+4.33%(BIRD)、+2.10%(Spider)、Magより+2.91%(BIRD)、+0.93%(Spider)となっている。さらに,従来のBoNやMagと比較して,ORMが単純なクエリに対して競合的な結果を得ると同時に,候補数の増加によるメリットも期待できる。

論文の概要: GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models

関連論文リスト