Fugu-MT 論文翻訳(概要): PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

論文の概要: PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

arxiv url: http://arxiv.org/abs/2605.23902v1
Date: Fri, 22 May 2026 17:59:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.465483
Title: PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
Title（参考訳）: PiD:Pixel Diffusionによる高速かつ高分解能潜時デコード
Authors: Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren,
Abstract要約: latent-to-pixel デコーダは再構成指向であり、詳細を合成するのではなく、エンコーダを反転するように最適化されている。条件付き画素拡散として遅延復号を再構成する画素拡散復号器であるPiDを導入する。高解像度のピクセル空間で直接ノイズを発生させることで、PiDは低レイテンシで4倍、さらに8倍のアップスケールの画像を合成する。
参考スコア（独自算出の注目度）: 65.47126282928896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes $4\times$ and even $8\times$ upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of $512 \times 512$ images into $2048 \times 2048$ pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about $6\times$ faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.
Abstract（参考訳）: 遅延拡散や自己回帰モデルを含むほとんどの実用的な高解像度テキスト・ツー・イメージシステムでは、コンパクトな潜時空間で生成を行い、デコーダは生成された潜時をピクセルにマッピングする。しかし、潜在画素デコーダは再構成指向であり、より詳細を合成するよりもエンコーダを反転させるように最適化されており、メガピクセルスケールではますますコストがかかる。この欠点は、より表現力があり効率的なデコードパラダイムを要求する。拡張性のあるピクセル空間拡散の最近の進展に触発され、我々は、ピクセル拡散デコーダであるPiDを導入し、遅延復号を条件付き画素拡散として再構成し、復号化と1つの生成モジュールへのアップサンプリングを行う。高解像度のピクセル空間で直接ノイズを発生させることで、PiDは低レイテンシで4\times$と8\times$のアップスケールイメージを合成する。潜伏条件付けのために、軽量シグマ対応アダプタは、ノイズ腐食した潜伏剤を画素拡散バックボーンに注入し、PiDが部分復号化潜伏剤をデコードし、早期に潜伏拡散プロセスを終了させる。さらに効率を向上させるために, DMD2を用いてモデルを蒸留し, 推論を4段階に短縮する。 PiDは、最近のRAEベースのモデルで使用される従来のVAE潜伏剤とセマンティック潜伏剤(例えば、SigLIP、DINOv2)の両方に適用される。 PiDは512 \times 512$イメージを2048 \times 2048$ピクセルを1秒未満で、消費者向けRTX 5090で13GBのピークメモリ、GB200 GPUで210ms、カスケード拡散ベースの超解像パイプラインより約6\times$高速にデコードする。

論文の概要: PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

関連論文リスト