Fugu-MT 論文翻訳(概要): Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

論文の概要: Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

arxiv url: http://arxiv.org/abs/2603.14186v1
Date: Sun, 15 Mar 2026 02:22:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.656753
Title: Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
Title（参考訳）: 多段階拡散モデルと流れモデルに対する1段階生成モデルの公平なベンチマーク
Authors: Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang,
Abstract要約: 最先端のテキスト画像モデルは高品質な画像を生成するが、推論は高価である。 1ステップモデルは、1ステップで画像にノイズをマッピングすることで、このコストを削減することを目的としている。 FIDに焦点を当てたモデル開発とCFGの選択は、いくつかの段階において誤解を招く可能性があることを示す。
参考スコア（独自算出の注目度）: 4.809245505572861
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.
Abstract（参考訳）: 最先端のテキスト・トゥ・イメージモデルは高品質な画像を生成するが、生成には数ステップのシーケンシャルなODEを必要とするため、推論は高価である。ネイティブワンステップモデルは、単一ステップで画像にノイズをマッピングすることで、このコストを削減することを目的としているが、マルチステップシステムとの公正な比較は困難である。また,一段階モデルがマルチステップ推論にどの程度の規模でスケールするかは明らかではなく,ラベル付きID条件付きジェネレータの標準出力評価がImageNet以外にも限られている。そこで我々は,ImageNet Validation, ImageNetV2, reLAIONetの制御されたクラス条件プロトコルの下で, ワンステップフロー(MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines(RAE, Scale-RAE), and established system(SiT, Staable Diffusion 3.5, FLUX.1)の8つのモデルをベンチマークした。 FID,Inception Score,CLIP Score,Pick Scoreを用いて、FIDに焦点を当てたモデル開発とCFG選択が、テキスト・画像のアライメントと人間の嗜好シグナルを劣化させ、認識品質を悪化させながら、FIDを改善できるいくつかの段階において誤解を招く可能性があることを示す。さらに、ステップスケーリングの恩恵を受け、マルチステップ推論の下ではより競争力のある1ステップモデルであることが示されるが、特性的な局所歪みは残る。これらのトレードオフを捉えるために、ガイダンスとステップスイープをまたいだハイパーパラメータ選択を安定化する4つのメトリクスすべてに対する複合プロキシであるMinMax Harmonic Mean(MMHM)を紹介します。

論文の概要: Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

関連論文リスト