Fugu-MT 論文翻訳(概要): HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

論文の概要: HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

arxiv url: http://arxiv.org/abs/2605.11061v1
Date: Mon, 11 May 2026 17:59:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.336604
Title: HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
Title（参考訳）: HiDream-O1-Image: Pixelレベル統一トランスを用いたネイティブ統一画像生成モデル
Authors: Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Yingwei Pan, Yi Peng, Zhaofan Qiu, Kai Yu, Yiheng Zhang, Hao Ai, Siying Bai, Yang Chen, Zhihui Chen, Fengbin Gao, Ying Guo, Dong Li, Zhen Shen, Leilei Shi, Jing Wang, Siyu Wang, Yimeng Wang, Rui Zheng, Ting Yao, Tao Mei,
Abstract要約: 画素空間拡散変換器を用いた統合生成基盤モデルであるHiDream-O1-Imageを提案する。 HiDream-O1-Imageは、原画像ピクセル、テキストトークン、タスク固有の条件を単一の共有トークン空間にマッピングすることにより、マルチモーダル入力の構造的統一を実現する。実験により、HiDream-O1-Imageは、テキスト・ツー・イメージ生成、命令ベースの編集、主観的パーソナライゼーションなど、さまざまな世代のタスクに優れることが示された。
参考スコア（独自算出の注目度）: 104.09730595701468
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.
Abstract（参考訳）: 視覚生成モデルの進化は、解離したテキストエンコーダと外部のVAEに依存する断片化されたアーキテクチャによって長い間制約されてきた。本稿では,HyDream-O1-Imageについて述べる。HyDream-O1-Imageは,画素空間Diffusion Transformerによるネイティブに統一された生成基盤モデルであり,モジュールアーキテクチャからエンド・ツー・エンドのビジュアル生成エンジンへのパラダイムシフトの先駆者である。 HiDream-O1-Imageは、原画像ピクセル、テキストトークン、タスク固有の条件を単一の共有トークン空間にマッピングすることにより、統一トランスフォーマー(UiT)アーキテクチャ内でのマルチモーダル入力の構造的統一を実現する。このネイティブエンコーディングパラダイムは、VAEを分離したり、事前訓練されたテキストエンコーダを分離する必要をなくし、モデルが一貫性のあるインコンテキスト推論プロセスとして様々な生成および編集タスクを扱えるようにする。大規模な実験により、HiDream-O1-Imageは、テキスト・ツー・イメージ生成、命令ベースの編集、主観的パーソナライゼーションなど、様々な世代のタスクに優れていた。特に、8Bパラメータしか持たないHiDream-O1-Image (8B)は、非常に大きなパラメータを持つ既存の最先端モデル(例:27B Qwen-Image)と同等あるいはそれ以上のパフォーマンスを達成する。重要なことに、このパラダイムの膨大なスケーラビリティを検証するために、アーキテクチャを200B以上のパラメータに拡張することに成功しました。この大規模バージョンであるHiDream-O1-Image-Pro(200B+)は、前例のない生成能力と優れたパフォーマンスを解放し、新しい最先端ベンチマークを確立することを実証した。最終的に、HiDream-O1-Imageは、ネイティブに統一されたアーキテクチャの潜在可能性を強調し、次世代マルチモーダルAIへの高度にスケーラブルなパスをグラフ化する。

論文の概要: HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

関連論文リスト