Fugu-MT 論文翻訳(概要): OLLM: Options-based Large Language Models

論文の概要: OLLM: Options-based Large Language Models

arxiv url: http://arxiv.org/abs/2604.19087v1
Date: Tue, 21 Apr 2026 04:59:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.627146
Title: OLLM: Options-based Large Language Models
Title（参考訳）: OLLM:オプションベースの大規模言語モデル
Authors: Shashank Sharma, Janina Hoffmann, Vinay Namboodiri,
Abstract要約: LLM(Options LLM)は,標準LLMの1つの次点予測を置き換える,単純で汎用的な手法である。小さな潜在空間は、下流ポリシーで選択または探索できる複数の可算次の選択肢をパラメータ化する。この結果から,選択された次世代モデリングは,数学推論における制御性,堅牢性,効率性を向上することが示された。
参考スコア（独自算出の注目度）: 1.4783646973333087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight "plug-in" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\%$ final answer correctness, while OLLM's option set allows up to $\sim 70\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.
Abstract（参考訳）: 本稿では,標準LLMの次の次の予測を,離散潜在変数でインデックス付けされた次のトークンの \textit{set of learned options} に置き換える,単純で汎用的な方法であるOptions LLM(OLLM)を紹介する。 OLLMモデルは、多様性を誘導するために温度やサンプリングヒューリスティックに頼る代わりに、明示的に変動する: 小さな潜在空間は、下流のポリシーで選択または探索できる複数の可算次の選択肢をパラメトリする。アーキテクチャ上、OLLMは軽量な"プラグイン"で、出力ヘッドの前にエンコーダとデコーダという2つのレイヤを挿入する。 OLLM を OpenMathReasoning でトレーニングし,OmniMath で評価した 1.7B パラメータバックボーン (トレーニング可能なパラメータの 1.56 % のみ) に適用する。 SOTA LoRAに適応したベースラインは511\%$final answer correctnessでピークであり、OLLMのオプションセットは最適な潜在選択の下で最大$\sim 70\%$である。次に、潜在空間でコンパクトなポリシーを訓練し、ラテントを出力して生成を制御する。低次元のオプション空間で運用することで、報酬の最適化はよりサンプリング効率が良くなり、SFTで学んだオプションに制限されるため、一般的なミスアライメント(例えば、言語スイッチや退化推論)が大幅に削減される。重要なことに、このアライメントは、追加のKLや手作りのアライメント損失よりもモデル構造に由来する。提案手法は,LLMにおける強化学習の有望な方向として,制御性,堅牢性,効率性を向上し,潜在空間政策学習が重要であることを示す。

論文の概要: OLLM: Options-based Large Language Models

関連論文リスト