Fugu-MT 論文翻訳(概要): MEMENTO: Teaching LLMs to Manage Their Own Context

論文の概要: MEMENTO: Teaching LLMs to Manage Their Own Context

arxiv url: http://arxiv.org/abs/2604.09852v1
Date: Fri, 10 Apr 2026 19:30:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.709498
Title: MEMENTO: Teaching LLMs to Manage Their Own Context
Title（参考訳）: Memento: LLMに独自のコンテキスト管理を教える
Authors: Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos,
Abstract要約: 推論モデルは、自身の中間状態を圧縮または整理するメカニズムのない、長い、構造化されていないストリームを思い浮かべる。 MementO: ブロックに推論を分割し、各ブロックをメメントに圧縮し、メメントにのみ参加して推論を行う方法をモデルに教える手法を紹介する。 OpenMementos上の2段階のSFTレシピは、異なるモデルファミリやスケールで有効であることを示す。
参考スコア（独自算出の注目度）: 50.3558738319336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.
Abstract（参考訳）: 推論モデルは、自身の中間状態を圧縮または整理するメカニズムのない、長い、構造化されていないストリームを思い浮かべる。 MementO は,メメントースにのみ参加し,コンテキストの削減,KVキャッシュ,計算を行うことで,各ブロックをブロックに分割し,メメントーに圧縮する手法である。 MementOモデルをトレーニングするために、OpenThoughts-v3から派生した228Kの推論トレースの公開データセットであるOpenMementosをリリースした。 OpenMementos上の2段階のSFTレシピは、異なるモデルファミリー(Qwen3, Phi-4, Olmo3)とスケール(8B-32Bパラメータ)で有効であることを示す。トレーニングされたモデルは、数学、科学、およびコーディングベンチマークにおいて強力な精度を維持しながら、${\sim}2.5\times$ peak KVキャッシュ削減を達成する。我々はvLLMを拡張して推論方法をサポートし、${\sim}1.75\times$スループットの改善を実現し、RLの実行と精度の向上を可能にした。最後に,2つの情報ストリームを識別する。各推論ブロックからの情報は,メメントテキストと対応するKV状態の両方で搬送され,元のブロックからの暗黙の情報を保持する。このチャンネルを除去すると、AIME24で15\,ppの精度が低下する。

論文の概要: MEMENTO: Teaching LLMs to Manage Their Own Context

関連論文リスト