Fugu-MT 論文翻訳(概要): SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

論文の概要: SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

arxiv url: http://arxiv.org/abs/2510.04961v1
Date: Mon, 06 Oct 2025 15:57:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.966674
Title: SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
Title（参考訳）: SSDD:効率的な画像トークン化のための単一ステップ拡散デコーダ
Authors: Théophane Vallaeys, Jakob Verbeek, Matthieu Cord,
Abstract要約: スケーリングとトレーニングの安定性を向上させるために,新しい画素拡散デコーダアーキテクチャを導入する。蒸留を用いて, 拡散復号器の性能を効率よく再現する。これによりSSDDは、敵の損失なしに訓練された単一ステップ再構成に最適化された最初の拡散デコーダとなる。
参考スコア（独自算出の注目度）: 56.12853087022071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
Abstract（参考訳）: トケナイザーは最先端の生成画像モデルの主要なコンポーネントであり、信号から最も重要な特徴を抽出し、データ次元と冗長性を低減している。 KL-VAE(KL-regularized variational autoencoders)に基づいており、再建、知覚的、対向的な損失を訓練している。拡散デコーダは、潜時条件付き画像上の分布をモデル化するためのより原理的な代替として提案されている。しかし、KL-VAEの性能の一致には相反する損失が必要であり、反復サンプリングによる復号時間も高い。これらの制約に対処するため、スケーリングとトレーニングの安定性を改善し、トランスフォーマーコンポーネントとGANフリートレーニングの恩恵を受けるために、新しいピクセル拡散デコーダアーキテクチャを導入する。蒸留を用いて, 拡散復号器の性能を効率よく再現する。これにより、SSDDは対向的な損失を伴わずに訓練された単一ステップ再構成に最適化された最初の拡散復号器となり、KL-VAEよりも高い再構成品質と高速サンプリングを実現した。特にSSDDはリコンストラクションFIDを0.87ドルから0.50ドルに改善し、1.4\times$高いスループットと3.8\times$速いサンプリングでDiTの生成品質を維持する。そのため、SSDDはKL-VAEのドロップイン代替として使用でき、高品質で高速な生成モデルの構築にも利用できる。

論文の概要: SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization

関連論文リスト