Fugu-MT 論文翻訳(概要): Soft Tokens, Hard Truths

論文の概要: Soft Tokens, Hard Truths

arxiv url: http://arxiv.org/abs/2509.19170v2
Date: Wed, 24 Sep 2025 11:28:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-25 14:09:11.259933
Title: Soft Tokens, Hard Truths
Title（参考訳）: 柔らかい剣、硬い真実
Authors: Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier,
Abstract要約: この研究は、強化学習(RL)を通して連続CoTを学習するスケーラブルな方法を導入する。我々は、RL探索を提供するために、トークンと入力埋め込みのノイズを混ぜた「ソフト」トークンを使用します。 LlamaとQwenのモデルによる数学推論ベンチマークでは、連続CoTによるトレーニングは、pass@1で離散CoTと一致し、pass@32でそれらを上回ります。
参考スコア（独自算出の注目度）: 17.640897774014707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
Abstract（参考訳）: 独立トークンの連続的な混合が複数の推論経路の重ね合わせを同時にシミュレートできるという直感に基づいて、LLMのチェーン・オブ・ソート(CoT)相における離散トークンの代わりに連続の使用が近年注目を集めている。理論的な結果は、連続トークンがはるかに高い表現性を持ち、特定の問題をより効率的に解けることを正式に証明している。以前の研究では、事前訓練された離散トークンモデルで推論時にのみ連続トークンを使用するか、または、基底トラスト離散CoTから連続CoTを蒸留し、CoTをごく少数のトークンに制限する計算コストに直面しなければならない。これは、参照離散CoTを蒸留することなく、強化学習(RL)を介して連続CoTを学習するスケーラブルな方法を導入する最初の試みである。我々は、RL探索を提供するために、トークンと入力埋め込みのノイズを混ぜた「ソフト」トークンを使用します。計算オーバーヘッドは最小限であり、数百のトークンで継続的CoTを学習することができます。 LlamaとQwenのモデルによる数学推論ベンチマークでは、8Bまでのモデルで、連続CoTによるトレーニングは、pass@1で個別CoTと一致し、pass@32でそれらを上回り、CoTの多様性が向上した。体系的な比較では、最もパフォーマンスの良いシナリオは、連続したCoTトークンを使用してトレーニングし、推論に個別トークンを使用することです。最後に、連続したCoT RLトレーニングは、ドメイン外タスクにおけるベースモデルの予測をより良く保存し、ベースモデルへのソフトタッチを提供することを示す。

論文の概要: Soft Tokens, Hard Truths

関連論文リスト