Fugu-MT 論文翻訳(概要): D-AR: Diffusion via Autoregressive Models

論文の概要: D-AR: Diffusion via Autoregressive Models

arxiv url: http://arxiv.org/abs/2505.23660v1
Date: Thu, 29 May 2025 17:09:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:08.019358
Title: D-AR: Diffusion via Autoregressive Models
Title（参考訳）: D-AR:自己回帰モデルによる拡散
Authors: Ziteng Gao, Mike Zheng Shou,
Abstract要約: Diffusion via Autoregressive Model (D-AR) は、画像拡散プロセスをバニラ自己回帰法として再キャストする新しいパラダイムである。本手法は,256個の離散トークンを持つ775MのLlamaバックボーンを用いて,2.09個のFIDを実現する。
参考スコア（独自算出の注目度）: 21.03363985989625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR
Abstract（参考訳）: 本稿では,Vanilla Autoregressive Method (D-AR) を用いて,画像拡散過程を標準的な次世代の予測方式で,バニラ自己回帰手法として再放送する新しいパラダイムを提案する。まず、画像を離散トークンのシーケンスに変換するトークンライザを設計し、そこで異なる位置のトークンを画素空間内の異なる拡散分解ステップにデコードする。拡散特性のおかげで、これらのトークンは自然に粗大な順序に従い、自己回帰モデリングに直結する。したがって、これらのトークンに対して、基本的な設計(因果マスクやトレーニング/推論戦略)を変更することなく、標準的な次のトークン予測を適用し、このようなシーケンシャルな自己回帰トークン生成は、画像空間における拡散手順を直接反映する。すなわち、自己回帰モデルがトークンの増分を生成すれば、これらのトークンをストリーミング方式で対応する拡散復調ステップに直接デコードすることができる。例えば、トークンのサブセットだけを生成するとき、一貫したプレビューをサポートし、ゼロショットレイアウト制御された合成を可能にする。標準のImageNetベンチマークでは,256個の離散トークンを持つ775MのLlamaバックボーンを用いて,2.09 FIDを達成する。我々の研究は、視覚合成の統一された自己回帰アーキテクチャ、特に大きな言語モデルに関する将来の研究に刺激を与えてくれることを願っている。コードとモデルはhttps://github.com/showlab/D-ARで入手できる。

論文の概要: D-AR: Diffusion via Autoregressive Models

関連論文リスト