Fugu-MT 論文翻訳(概要): ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

論文の概要: ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

arxiv url: http://arxiv.org/abs/2512.07843v1
Date: Mon, 24 Nov 2025 18:55:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-15 04:16:52.506021
Title: ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
Title（参考訳）: ThreadWeaver: 言語モデルにおける効率的な並列推論のための適応的なスレッド化
Authors: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin,
Abstract要約: 適応並列推論のためのフレームワークThreadWeaverを紹介します。 ThreadWeaverは、同等サイズの一般的なシーケンシャル推論モデルと同等の精度を達成する。 ThreadWeaverはトークンのレイテンシの平均速度を最大1.53倍にします。
参考スコア（独自算出の注目度）: 99.6720868215076
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
Abstract（参考訳）: 推論時間計算のスケールにより、Large Language Models (LLM) は強力な推論性能を実現することができたが、本質的にシーケンシャルな復号化は、特に複雑なタスクにおいて、かなりのレイテンシをもたらす。適応並列推論に関する最近の研究は、問題解決プロセスを有効であれば並列推論スレッドに分解することで、推論効率を向上させることを目的としている。しかし、現実的なタスクに対する既存の手法は、教師付き行動クローニングに限られるか、広く使われている長いチェーン・オブ・シークエンス(CoT)ベースラインと比較して、かなりの精度の低下を示す。さらに、多くはカスタマイズされた推論エンジンを必要とし、デプロイを複雑にする。我々はThreadWeaverを紹介した。これは適応並列推論のためのフレームワークで、最大サイズの一般的なシーケンシャル推論モデルと同等の精度を実現し、推論遅延を著しく低減します。 ThreadWeaverのパフォーマンスは、3つの重要なイノベーションに由来する。 1) 大規模かつ高品質なCoTデータを生成する2段階並列軌道生成装置。 2 位置埋め込み又はKVキャッシュを変更することなく、既製の自己回帰推論エンジンの並列推論を可能にするトリエベースのトレーニング推論共設計 3) 効果的な並列化と精度のバランスをモデルに教える並列化対応強化学習フレームワーク。 6つの挑戦的な数学的推論ベンチマークの中で、ThreadWeaverはQwen3-8B上でトレーニングを行い、最先端のシーケンシャル推論モデル(平均71.9%、AIME24では79.9%)に匹敵する精度を実現した。

論文の概要: ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

関連論文リスト