Fugu-MT 論文翻訳(概要): Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

論文の概要: Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

arxiv url: http://arxiv.org/abs/2508.12140v1
Date: Sat, 16 Aug 2025 19:25:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.58492
Title: Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality
Title（参考訳）: 医療推論における予算編成の効率化のフロンティアを探る:計算資源と推論品質のスケーリング法則
Authors: Ziqian Bi, Lu Chen, Junhao Song, Hongying Luo, Enze Ge, Junmin Huang, Tianyang Wang, Keyu Chen, Chia Xin Liang, Zihan Wei, Huafeng Liu, Chunjie Tian, Jibin Guan, Joe Yeong, Yongzhi Xu, Peng Wang, Junfeng Hao,
Abstract要約: 本研究は,医学的推論タスクにおける思考予算機構の包括的評価である。 Qwen3とDeepSeek-R1の2つの主要なモデルファミリーを、様々な専門性や難易度にまたがる15の医療データセットで評価した。
参考スコア（独自算出の注目度）: 11.743970673134573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.
Abstract（参考訳）: 本研究では,計算資源と推論品質の基本的なスケーリング法則を明らかにするとともに,医療推論タスクにおける思考予算機構の総合的な評価を行った。 Qwen3 (1.7Bから235Bのパラメータ) とDeepSeek-R1 (1.5Bから70Bのパラメータ) の2つの主要なモデルファミリーを, 多様な専門性と難易度にまたがる15の医療データセットで体系的に評価した。ゼロから無限のトークンを含む思考予算を用いた制御実験を通じて、精度改善が思考予算とモデルサイズの両方で予測可能なパターンに従う対数スケーリング関係を確立する。本研究は, リアルタイムアプリケーションに適した高効率(0～256トークン), バランス(256～512トークン), 定期的な臨床支援に最適なコストパフォーマンストレードオフを提供する高効率(512トークン以上), 重要な診断タスクにのみ正当化された高精度(512トークン以上)の3つの異なる効率体制を明らかにした。特に、より小さなモデルでは、大きなモデルでは5～10%よりも15～20%改善され、キャパシティに制約されたモデルでは、思考予算がより大きな相対的利益をもたらすという相補的な関係が示唆される。ドメイン固有のパターンは明らかに現れ、神経学と胃腸科学は心臓血管や呼吸器医学よりもはるかに深い推論プロセスを必要とする。 Qwen3ネイティブな思考予算APIとDeepSeek-R1のためのトラクション手法との整合性は、アーキテクチャ全体にわたる思考予算概念の一般化可能性を検証する。これらの結果は、医療AIシステムを最適化するための重要なメカニズムとして予算管理を確立し、医療展開に不可欠な透明性を維持しつつ、臨床ニーズに沿った動的リソース割り当てを可能にする。

論文の概要: Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning: Scaling Laws between Computational Resources and Reasoning Quality

関連論文リスト