Fugu-MT 論文翻訳(概要): Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

論文の概要: Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

arxiv url: http://arxiv.org/abs/2604.18168v1
Date: Mon, 20 Apr 2026 12:28:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.860185
Title: Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Title（参考訳）: 識別的テキスト表現によるクラスラベルからテキストへのワンステップ画像生成
Authors: Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang,
Abstract要約: テキスト条件付きMeanFlow生成プロセスを初めて開発する。従来のトレーニング戦略を用いた強力なテキストエンコーダの統合は、満足のいくパフォーマンスをもたらす。この研究が、テキスト条件のMeanFlow生成に関する将来の研究に、汎用的で実用的なリファレンスを提供してくれることを願っている。
参考スコア（独自算出の注目度）: 37.78791777901399
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.
Abstract（参考訳）: MeanFlowによって実証された最近のワンステップ生成手法により、目覚ましい結果が得られた。 MeanFlowに関する既存の研究は、主にクラス・ツー・イメージ生成に焦点を当てている。しかし、直感的で探索されていない方向性は、条件を固定されたクラスラベルから柔軟なテキスト入力に拡張し、よりリッチなコンテンツ作成を可能にすることである。限られたクラスラベルと比較して、テキスト条件はモデルの理解能力に大きな課題をもたらし、強力なテキストエンコーダをMeanFlowフレームワークに統合する必要がある。意外なことに、テキスト条件の組み込みは簡単に見えるが、従来のトレーニング戦略を用いた強力なLLMベースのテキストエンコーダの統合は、満足のいくパフォーマンスをもたらす。原因を明らかにするために,本研究では,MeanFlow 世代における改良段階が極めて限られているため,テキスト特徴表現が十分に高い識別性を持つことが求められている。これはまた、MeanFlowフレームワーク内で、分別的で容易に識別可能なクラス機能がうまく機能する理由を説明する。これらの知見に導かれて,必要な意味的特性を持つことが検証された強力なLCMベースのテキストエンコーダを活用し,MeanFlow生成プロセスをこのフレームワークに適応させることで,テキスト条件の効率的な合成を初めて実現した。さらに,広範に利用されている拡散モデルに対して,本手法の有効性を検証し,生成性能の大幅な向上を実証した。この研究が、テキスト条件のMeanFlow生成に関する将来の研究に、汎用的で実用的なリファレンスを提供してくれることを願っている。コードはhttps://github.com/AMAP-ML/EMFで公開されている。

論文の概要: Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

関連論文リスト