Fugu-MT 論文翻訳(概要): Neural Garbage Collection: Learning to Forget while Learning to Reason

論文の概要: Neural Garbage Collection: Learning to Forget while Learning to Reason

arxiv url: http://arxiv.org/abs/2604.18002v1
Date: Mon, 20 Apr 2026 09:26:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.788363
Title: Neural Garbage Collection: Learning to Forget while Learning to Reason
Title（参考訳）: ニューラルガベージコレクション: 推論を学習しながら忘れることを学ぶ
Authors: Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox, Noah D. Goodman,
Abstract要約: ニューラルガベージコレクション(Neural Garbage Collection)では、言語モデルが推論を学習しながら忘れることを学ぶ。言語モデルからサンプリングした離散的なアクションとして、チェーン・オブ・シンクとキャッシュ消去決定におけるトークンを扱い、強化学習を用いてモデルの理由を協調的に最適化することができる。 Countdown, AMC, AIMEタスクでは, NGCは2～3倍のピークKVキャッシュサイズ圧縮において, フルキャッシュ上限に対して高い精度を維持している。
参考スコア（独自算出の注目度）: 36.674101487378245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model's behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can't it learn to forget? We introduce Neural Garbage Collection (NGC), in which a language model learns to forget while learning to reason, trained end-to-end from outcome-based task reward alone. As the model reasons, it periodically pauses, decides which KV cache entries to evict, and continues to reason conditioned on the remaining cache. By treating tokens in a chain-of-thought and cache-eviction decisions as discrete actions sampled from the language model, we can use reinforcement learning to jointly optimize how the model reasons and how it manages its own memory: what the model evicts shapes what it remembers, what it remembers shapes its reasoning, and the correctness of that reasoning determines its reward. Crucially, the model learns this behavior entirely from a single learning signal - the outcome-based task reward - without supervised fine-tuning or proxy objectives. On Countdown, AMC, and AIME tasks, NGC maintains strong accuracy relative to the full-cache upper bound at 2-3x peak KV cache size compression and substantially outperforms eviction baselines. Our results are a first step towards a broader vision where end-to-end optimization drives both capability and efficiency in language models.
Abstract（参考訳）: しかし、すべての推論ステップはKVキャッシュを増大させ、このパラダイムをさらにスケールするためのボトルネックを生み出します。現在のアプローチでは、手書きの基準を使ってモデルに代わってこれらの制約を管理している。よりスケーラブルなアプローチでは、ディープラーニングの広範なパターンに従って、エンドツーエンドの学習がこの設計選択を完全にサブスクライブすることができる。結局のところ、モデルが推論を学べるなら、なぜそれを忘れることが学べないのだろうか? 我々は、言語モデルが学習中に忘れることを学ぶニューラルネットワークガベージコレクション(NGC)を導入し、結果に基づくタスク報酬のみからエンドツーエンドに訓練した。モデルが理由として、定期的に停止し、どのKVキャッシュエントリを省略するかを決定し、残りのキャッシュに条件付きで推論を継続する。言語モデルからサンプリングされた個別のアクションとして、チェーン・オブ・シンクとキャッシュ消去決定におけるトークンを扱い、強化学習を使用してモデル理由とそれが自身のメモリをどのように管理するかを共同で最適化することができる。重要なのは、モデルがこの振る舞いを、微調整やプロキシの目的を監督することなく、単一の学習信号(結果に基づくタスク報酬)から完全に学習することです。 Countdown, AMC, AIMEのタスクでは、NGCは2-3倍のピークKVキャッシュサイズでのフルキャッシュ上限に対して高い精度を維持し、エビクションベースラインを大幅に上回る。私たちの結果は、エンドツーエンドの最適化が言語モデルの能力と効率の両方を駆動する、より広いビジョンに向けた第一歩です。

論文の概要: Neural Garbage Collection: Learning to Forget while Learning to Reason

関連論文リスト