Fugu-MT 論文翻訳(概要): Revealing the Learning Dynamics of Long-Context Continual Pre-training

論文の概要: Revealing the Learning Dynamics of Long-Context Continual Pre-training

arxiv url: http://arxiv.org/abs/2604.02650v1
Date: Fri, 03 Apr 2026 02:26:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.283479
Title: Revealing the Learning Dynamics of Long-Context Continual Pre-training
Title（参考訳）: 長期継続事前学習の学習ダイナミクスの解明
Authors: Yupu Liang, Shuang Chen, Guanwei Zhang, Shaolei Wang, Suncong Zheng,
Abstract要約: 産学級Hunyuan-A13B(総パラメータ80B)を用いたLCCP学習力学の最初の系統的研究について述べる。具体的には、LCCPの挙動(教師付き微調整探索)、確率的(複雑度)、機械的(注意パターン)レベルにわたってLCCPのダイナミクスを分析する階層的枠組みを提案する。
参考スコア（独自算出の注目度）: 5.904978452102138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs' LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report "fake saturation" early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.
Abstract（参考訳）: LCCP(Long-Context Continual Pre-training)に関する既存の研究は、主に小規模モデルと限られたデータ構造(数十億のトークン)に焦点を当てている。我々は、これらの小規模設定を産業レベルのモデルへ直接移行することは、適応の不十分さと早期訓練終了のリスクを負うと論じている。さらに、現在の評価手法は下流のベンチマーク(例えば、Needle-in-a-Haystack)に大きく依存しており、本質的な収束状態の反映に失敗することが多く、「知覚飽和」につながる可能性がある。本稿では,産学級のHunyuan-A13B(総パラメータ80B)を用いてLCCP学習のダイナミクスを初めて体系的に研究し,その進化を200Bの訓練軌道で追跡する。具体的には、LCCPの挙動(教師付き微調整探索)、確率的(複雑度)、機械的(注意パターン)レベルにわたってLCCPのダイナミクスを分析する階層的枠組みを提案する。その結果,(1)大量データスケーリングの必要性:工業用LCCPのLCCPには数十億のトークンが不足している(例:Hunyuan-A13Bは150B以上のトークンのトレーニング後に飽和する)。 2) 知覚的飽和と内在性飽和:従来のNIAHスコアは早期に「偽飽和」を報告し, PPLに基づく分析では, 連続的内在性改善が明らかになり, 下流性能と相関が強くなった。 3) 学習安定のためのメカニスティックモニタリング: 検索ヘッドはLCCPの進捗を確実に追跡し、SFTの結果と高い相関を示すため、効率的で低リソースのトレーニングモニターとして機能する。本研究は,産業用LLMのLCCPに対する総合的なモニタリングフレームワーク,評価システム,および機械的解釈を提供する。

論文の概要: Revealing the Learning Dynamics of Long-Context Continual Pre-training

関連論文リスト