Fugu-MT 論文翻訳(概要): Layer Collapse in Diffusion Language Models

論文の概要: Layer Collapse in Diffusion Language Models

arxiv url: http://arxiv.org/abs/2605.06366v2
Date: Mon, 11 May 2026 08:56:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 16:21:29.399567
Title: Layer Collapse in Diffusion Language Models
Title（参考訳）: 拡散言語モデルにおける層崩壊
Authors: Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu,
Abstract要約: 拡散言語モデル (DLM) は自己回帰言語モデル (AR) の代替として登場した。 DLMの層崩壊は, 過度なトレーニングによるものではなく, 過度なトレーニングによるものであることを示す。私たちの発見は、非常に実践的な意味を持っている。
参考スコア（独自算出の注目度）: 54.880703002010144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.
Abstract（参考訳）: 拡散言語モデル(DLM)は、近年、自己回帰(AR)言語モデルの競合代替として登場したが、アクティベーションのダイナミクスの相違はよく分かっていない。我々はこれらのダイナミクスをLLaDA-8Bで特徴付け、顕著な層崩壊特性を識別する:いくつかの初期の層は、長いトークン範囲に留まる1つの大きな超降圧器が支配する非常によく似た、崩壊した活性化パターンを示す。出力をプルーニングすることで、繰り返しランダムなトークンループに分解する。パラドックス的には、LLaDAのレイヤは全体としてより冗長な表現を含んでおり、以前のレイヤで最も顕著な冗長性はARモデルの逆で、アンダートレーニングによってより深いレイヤが冗長になる。分析の結果,DLMの層崩壊は過度な訓練によって行われるのではなく,過剰な訓練によって引き起こされることが明らかとなった。これらの知見は、制御された事前学習実験を通じて検証された、強力な実用的意味を持つ。 3ビット GPTQ 以下の LLaDA は GSM8K で -1.8% しか低下しないのに対し、Llama-3.1-8B は-64.7% である。 LLaDAの初期層に割り当てる割合は、逆戦略で+8.4%、Llama -8.4%である。以上の結果から,DLMトレーニングの目的は,ARモデルに対する層動特性を根本的に改善することであり,圧縮や展開に直接影響することが示唆された。コード:github.com/Conzel/super-outlier-dlm。

論文の概要: Layer Collapse in Diffusion Language Models

関連論文リスト