Fugu-MT 論文翻訳(概要): Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

論文の概要: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

arxiv url: http://arxiv.org/abs/2510.10885v1
Date: Mon, 13 Oct 2025 01:29:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.139144
Title: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks
Title（参考訳）: エージェントワークフローの再考: Text2SQLタスクにおける推論ベースのテスト時間スケーリング戦略の評価
Authors: Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren,
Abstract要約: 大規模言語モデル(LLM)はText-to-(Text2)システムにますます力を入れている。テストタイムのスケーリング戦略はLLMベースのソリューションでは有望だが、現実のアプリケーション、特に最新の推論モデルでは、その有効性は不確実である。この作業は、Text2システムをデプロイする際の正確性、効率、複雑さの間の実践的なトレードオフに光を当てています。
参考スコア（独自算出の注目度）: 21.891522433628893
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
Abstract（参考訳）: 大規模言語モデル(LLM)はText-to-SQL(Text2SQL)システムにますます力を入れている。テストタイムのスケーリング戦略はLLMベースのソリューションでは有望だが、現実のアプリケーション、特に最新の推論モデルでは、その有効性は不確実である。本研究では,BIRD Mini-Devベンチマークにおいて,軽量で産業指向の6つのテストタイムスケーリング戦略と2つの推論モデルを含む4つのLCMをベンチマークした。標準的な精度の指標以外にも、推論のレイテンシやトークンの消費を報告し、実用的なシステムデプロイメントに関する洞察を提供しています。以上の結果から,Divide-and-Conquerのプロンプトと数発のデモにより,汎用と推論に焦点をあてたLLMの性能が一貫して向上することが判明した。しかし、追加のワークフローステップを導入すると、結果が混ざり合い、ベースモデルの選択が重要な役割を果たす。この作業は、Text2SQLシステムをデプロイする際の正確性、効率、複雑さの間の実践的なトレードオフに光を当てている。

論文の概要: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

関連論文リスト