Fugu-MT 論文翻訳(概要): Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately

論文の概要: Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately

arxiv url: http://arxiv.org/abs/2505.13326v1
Date: Mon, 19 May 2025 16:34:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-20 14:57:11.739573
Title: Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
Title（参考訳）: 簡潔で正しい思考: LLM推論を効果的かつ正確に行う
Authors: Yuhang Wang, Youhe Jiang, Bin Cui, Fangcheng Fu,
Abstract要約: 大規模言語モデル(LLM)は、所定の要求に応答するChain-of-Thought推論を生成することで、より優れた機能を得ることができる。しかし,2つのスケーリング次元を取り入れた場合,システム効率は2つの理由から著しく低下する。本稿では,効率的なLLM推論のためのサービスフレームワークであるSARTについて述べる。
参考スコア（独自算出の注目度）: 29.018731931275138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in test-time scaling suggest that Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning (analogous to human thinking) to respond a given request, and meanwhile exploring more reasoning branches (i.e., generating multiple responses and ensembling them) can improve the final output quality. However, when incorporating the two scaling dimensions, we find that the system efficiency is dampened significantly for two reasons. Firstly, the time cost to generate the final output increases substantially as many reasoning branches would be trapped in the over-thinking dilemma, producing excessively long responses. Secondly, generating multiple reasoning branches for each request increases memory consumption, which is unsuitable for LLM serving since we can only batch a limited number of requests to process simultaneously. To address this, we present SART, a serving framework for efficient and accurate LLM reasoning. The essential idea is to manage the thinking to be short and right, rather than long. For one thing, we devise a redundant sampling with early stopping approach based on empirical observations and theoretic analysis, which increases the likelihood of obtaining short-thinking responses when sampling reasoning branches. For another, we propose to dynamically prune low-quality branches so that only right-thinking branches are maintained, reducing the memory consumption and allowing us to batch more requests. Experimental results demonstrate that SART not only improves the accuracy of LLM reasoning but also enhances the serving efficiency, outperforming existing methods by up to 28.2 times and on average 15.7 times in terms of efficiency when achieving the same level of accuracy.
Abstract（参考訳）: テスト時間スケーリングの最近の進歩は、Large Language Models (LLMs) が、与えられた要求に応答するためにChain-of-Thought推論(人間の思考と類似)を生成し、さらに推論ブランチ(つまり複数の応答を生成し、それらを組み立てる)を探索することで、最終的な出力品質を向上できることを示唆している。しかし,2つのスケーリング次元を取り入れた場合,システム効率は2つの理由から著しく低下することがわかった。第一に、最終的な出力を生成するための時間コストは、多くの推論枝が過剰に考え過ぎたジレンマに閉じ込められ、過度に長い応答をもたらすため、大幅に増加する。第二に、リクエスト毎に複数の推論ブランチを生成すると、メモリ消費が増加します。そこで本稿では,効率的なLLM推論のためのサービスフレームワークであるSARTを提案する。基本的な考え方は、長くではなく、短くて正しい考え方を管理することです。ひとつは、経験的観察と理論分析に基づく早期停止アプローチによる冗長サンプリングを考案し、推論枝をサンプリングする際の短い思考応答を得る可能性を高めることである。もうひとつは、動的に低品質のブランチをプルークし、右から考えるブランチのみを保守し、メモリ消費を削減し、より多くのリクエストをバッチ化できるようにすることです。実験結果から,SARTはLCM推論の精度を向上するだけでなく,従来の手法を最大28.2倍,平均15.7倍の効率で性能を向上することが示された。

論文の概要: Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately

関連論文リスト