Fugu-MT 論文翻訳(概要): CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

論文の概要: CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

arxiv url: http://arxiv.org/abs/2605.20075v1
Date: Tue, 19 May 2026 16:28:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.524492
Title: CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
Title（参考訳）: CopT: 汎用的およびエージェント的推論のための継続的空間との対照的なオン・ポリティクス
Authors: Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee,
Abstract要約: CopTは、通常の思考と回答の順序を逆転する、改訂された推論パイプラインである。 CopTは、ドラフト回答を付与し、その後、独自のドラフト回答で条件付きで、後続のオンライン思考を起動する。 CopTはピーク精度を最大23%改善し、トークン使用量を最大57%削減する。
参考スコア（独自算出の注目度）: 22.944748148277146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.
Abstract（参考訳）: CoT(Chain-of-Thought)は、大規模言語モデル(LLM)から推論能力を引き出すための標準的なアプローチである。しかし、一般的なCoTパラダイムは、思考を答えの前提条件として扱い、これは、モデルが拡張思考の前に答えを特定できる場合であっても、妥当な答えへのアクセスを遅らせたり、不要なトークンコストを発生させる可能性がある。本稿では,通常の思考と回答の順序を逆転する改良された推論パイプラインであるCopTを紹介する。答える前に考える代わりに、CopTはまずドラフト回答を導き、その後、リフレクションと修正のために独自のドラフト回答に条件付けされた後続のオンライン思考を起動する。ドラフト回答が信頼されるべきかどうかを評価するため、CopTは連続的な埋め込みを推論時コントラスト検証として再キャストする。具体的には、離散的な入力と連続埋め込み入力の下で同じ生成されたトークンに対するモデルによるサポートとは対照的であり、応答信頼性のためのシーケンスレベルの逆KL推定器が生成される。分析の結果, ある仮定の下では, 予測された推定値が未解決の潜伏状態と出力された応答トークンの相互情報と等しいことが示され, 潜伏状態における任意の不確実性ではなく, 応答関連不確実性を取得する理由が説明された。回答が不十分であると判断された場合、第2のKL推定器が動的にドラフト・アンサー・ビジュアライゼーションを制御し、信頼性の低いコンテンツに誤解されるリスクを低減し、有用な部分情報の保存を行う。数学、コーディング、エージェント推論タスク全体にわたって、CopTはピーク精度を最大23%改善し、追加のトレーニングなしでトークン使用量を最大57%削減する。コードはhttps://github.com/sdc17/CopT.comで入手できる。

論文の概要: CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

関連論文リスト