Fugu-MT 論文翻訳(概要): Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

論文の概要: Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

arxiv url: http://arxiv.org/abs/2510.04182v1
Date: Sun, 05 Oct 2025 12:50:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.50342
Title: Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
Title（参考訳）: フライトを考える: 潜在思想政策最適化によるテスト時間推論の強化
Authors: Wengao Ye, Yan Liang, Lianlei Shan,
Abstract要約: Latent Thought Policy Optimizationは、LLM推論を完全にテスト時に強化する。実験により、LTPOは標準タスクの強いベースラインに適合または超えるだけでなく、他のタスクが失敗する際、顕著な堅牢性を示すことが示された。とりわけ、既存の遅延推論ベースラインがほぼゼロに近い精度に崩壊する非常に難しいAIMEベンチマークでは、LTPOが大幅に改善されている。
参考スコア（独自算出の注目度）: 5.674809920704963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
Abstract（参考訳）: 近年のLarge Language Models (LLMs) の進歩は、明示的なChain-of-Thought (CoT) 推論からより効率的な潜在推論へと移行し、中間思考はテキストではなくベクトルとして表現されるようになった。しかし、頑健な推論が最重要となる、難解で非分配的なタスクに対して、潜伏推論は脆弱である可能性がある。これらの制限を克服するために、モデルパラメータ更新を必要とせず、LLM推論を完全に拡張するパラメータフリーフレームワークであるLTPO(Latent Thought Policy Optimization)を導入する。 LTPOは、各問題インスタンスに積極的に最適化される動的パラメータとして、中間潜伏ベクトルを扱います。凍結したLCMの出力分布から直接計算される本質的な信頼性に基づく報酬信号によって誘導されるオンラインポリシー勾配法を採用しており、最適化中に外部の監視や高価なテキスト生成を不要にしている。 5つの推論ベンチマークの大規模な実験は、LTPOが標準タスクの強いベースラインに適合または超えるだけでなく、他のタスクが失敗する際、顕著な堅牢性を示すことを示している。とりわけ、既存の遅延推論ベースラインがほぼゼロに近い精度に崩壊する高度に挑戦的なAIMEベンチマークでは、LTPOは大幅に改善され、複雑な推論にユニークな機能を示している。

論文の概要: Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

関連論文リスト