Fugu-MT 論文翻訳(概要): Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

論文の概要: Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

arxiv url: http://arxiv.org/abs/2604.10074v1
Date: Sat, 11 Apr 2026 07:46:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.825165
Title: Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
Title（参考訳）: 変換器はマルチトークンGMMのための最適DDPMデノイザを学習する
Authors: Hongkang Li, Hancheng Min, Rene Vidal,
Abstract要約: 本稿では,変圧器を用いた拡散モデルのトレーニングのための最初の収束解析を行う。人口の拡散確率モデル (DDPM) の目的を, 人口の分散確率モデル (DDPM) とみなす。より深い調査により、トレーニングされた平均偏極モジュールは、拡散ステップにおける注入音の最小値平均正方形誤差(MMSE)を近似することができることが明らかになった。
参考スコア（独自算出の注目度）: 13.741630476895773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.
Abstract（参考訳）: トランスフォーマーに基づく拡散モデルは、高品質なサンプルを生成する際、顕著な性能を示した。しかし、この成功の理由に関する理論的理解は依然として限られている。例えば、既存のモデルは典型的には、トレーニングデータのスコア関数に適合するデノナイジング目標を最小化することでトレーニングされる。しかし、トランスフォーマーベースモデルがスコア関数と一致する理由や、非凸ロスランドスケープにもかかわらず勾配ベース手法が最適デノナイジングモデルに収束する理由がわからない。我々の知る限り、この論文はトランスフォーマーに基づく拡散モデルのトレーニングのための最初の収束解析を提供する。具体的には,多孔質ガウス混合分布を追従するデータをデノナイズするDDPM(Denoising Diffusion Probabilistic Model)の目的について考察する。理論的には,各データポイント当たりのトークン数と,ベイズ最適リスクに対するグローバル収束の訓練反復数を定量化し,所望のスコアマッチング誤差を実現する。より深い調査により、トレーニングされた変圧器の自己アテンションモジュールが平均デノナイジング機構を実装し、トレーニングされたモデルが拡散ステップにおける注入音の最小値平均二乗誤差(MMSE)推定器を近似できるようにすることが明らかになった。数値実験はこれらの発見を検証した。

論文の概要: Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

関連論文リスト