Fugu-MT 論文翻訳(概要): Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

論文の概要: Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

arxiv url: http://arxiv.org/abs/2510.16751v2
Date: Fri, 24 Oct 2025 20:55:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 13:14:10.573654
Title: Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
Title（参考訳）: 推論時間スケーリングによる視覚自己回帰モデルと拡散モデル
Authors: Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos,
Abstract要約: 視覚自己回帰モデルの離散的かつ逐次的な性質は、画像生成を効果的に検索できることを示す。ビームサーチはテキスト・画像生成を大幅に改善し、2Bパラメータ自己回帰モデルがベンチマーク間で12Bパラメータ拡散モデルより優れていることを示す。
参考スコア（独自算出の注目度）: 3.558452956820138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
Abstract（参考訳）: 検索による推論時間のスケーリングは大規模言語モデルに革命をもたらしたが、これらのゲインを画像生成に翻訳することは困難であることが証明された。連続拡散モデルに探索戦略を適用しようとする最近の試みは限られた利点を示し、単純なランダムサンプリングがしばしば最適である。視覚自己回帰モデルの離散的かつ逐次的な性質は、画像生成に有効な探索を可能にすることを実証する。ビームサーチはテキスト・画像生成を大幅に改善し、2Bパラメータ自己回帰モデルがベンチマーク間で12Bパラメータ拡散モデルより優れていることを示す。この利点は, 早期刈り込みと計算再利用が可能な離散トークン空間から得られるものであり, 検証器解析では, 速度と推論能力のトレードオフが強調されている。これらの結果から,モデルアーキテクチャは単なるスケールではなく,視覚生成における推論時間最適化に重要であることが示唆された。

論文の概要: Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

関連論文リスト