Fugu-MT 論文翻訳(概要): End-to-End Test-Time Training for Long Context

論文の概要: End-to-End Test-Time Training for Long Context

arxiv url: http://arxiv.org/abs/2512.23675v1
Date: Mon, 29 Dec 2025 18:30:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:30.614849
Title: End-to-End Test-Time Training for Long Context
Title（参考訳）: 長期学習のためのエンド・ツー・エンドテストタイムトレーニング
Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun,
Abstract要約: アーキテクチャ設計よりも継続学習における問題として,長文言語モデリングを定式化する。我々のモデルは、与えられたコンテキストの次から次までの予測を通じてテスト時に学習を続け、読み込んだコンテキストを重みに圧縮します。全体として、テストタイムトレーニング(TTT)の一形態であるE2E(End-to-End)は、テスト時(次世代の予測)とトレーニング時(メタラーニング)の両方である。
参考スコア（独自算出の注目度）: 98.3930777591529
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
Abstract（参考訳）: アーキテクチャ設計よりも継続学習における問題として,長文言語モデリングを定式化する。この定式化の下では、標準的なアーキテクチャ -- スライドウインドウの注意を持つトランスフォーマーのみを使用します。しかし、我々のモデルは、与えられたコンテキストの次点予測を通じてテスト時に学習を続け、読み込んだコンテキストを重みに圧縮する。さらに,学習時のメタ学習を通じて,テスト時に学習するモデルの初期化も改善する。テストタイムトレーニング(TTT)の一形態である本手法は,テスト時間(次点予測による)とトレーニング時間(メタ学習による)の両方において,従来と対照的にエンド・ツー・エンド(E2E)である。我々は、プロパティのスケーリングに焦点をあてて、広範な実験を行う。特に、164Bトークンでトレーニングされた3Bモデルでは、我々のメソッド(TTT-E2E)はTransformerと同様のコンテキスト長でスケールするが、Mamba 2やGated DeltaNetのような他のモデルではそうではない。しかし、RTNと同様、TTT-E2Eはコンテキスト長に関わらず一定の推論遅延を持ち、128Kのコンテキストに対してフルアテンションよりも2.7倍高速である。私たちのコードは公開されています。

論文の概要: End-to-End Test-Time Training for Long Context

関連論文リスト