Fugu-MT 論文翻訳(概要): Seedream 4.0: Toward Next-generation Multimodal Image Generation

論文の概要: Seedream 4.0: Toward Next-generation Multimodal Image Generation

arxiv url: http://arxiv.org/abs/2509.20427v2
Date: Sun, 28 Sep 2025 13:10:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 14:13:47.614874
Title: Seedream 4.0: Toward Next-generation Multimodal Image Generation
Title（参考訳）: Seedream 4.0:次世代マルチモーダル画像生成に向けて
Authors: Team Seedream, :, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu,
Abstract要約: Seedream 4.0は効率的かつ高性能なマルチモーダル画像生成システムである。テキスト・トゥ・イメージ(T2I)合成、画像編集、複数画像合成を単一のフレームワークに統合する。 Seedream 4.0は、多種多様な知識中心の概念にまたがる数十億のテキストイメージ対で事前訓練されている。
参考スコア（独自算出の注目度）: 88.86697995940511
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.
Abstract（参考訳）: 本稿では,テキスト・ツー・イメージ(T2I)合成,画像編集,複数画像合成を単一のフレームワークで統合する,効率的かつ高性能なマルチモーダル画像生成システムであるSeedream 4.0を紹介する。我々は,強力なVAEを用いた高効率拡散変換器を開発し,画像トークンの数を著しく削減する。これにより、モデルの効率的なトレーニングが可能になり、ネイティブな高解像度画像(例えば、1K-4K)を高速に生成することができる。 Seedream 4.0は、多種多様な分類学と知識中心の概念にまたがる数十億のテキストイメージ対で事前訓練されている。数百の垂直シナリオにわたる包括的なデータ収集と最適化された戦略が組み合わさって、安定的で大規模なトレーニングを確実にし、強力な一般化を実現します。念入りに微調整されたVLMモデルを組み込むことで、T2Iと画像編集の両タスクを共同で訓練するためのマルチモーダル後訓練を行う。推論加速には, 逆蒸留, 分布マッチング, 量子化, 投機的復号化が組み込まれている。 2K画像を生成するのに最大1.8秒の推論時間を達成する(PEモデルとしてLLM/VLMを使用せずに)。総合評価の結果、Seedream 4.0はT2Iとマルチモーダル画像編集の両方で最先端の結果が得られることがわかった。特に、正確な画像編集やコンテキスト内推論を含む複雑なタスクにおいて例外的なマルチモーダル機能を示し、マルチイメージ参照を可能にし、複数の出力画像を生成することができる。これにより、従来のT2Iシステムをよりインタラクティブで多次元のクリエイティブツールに拡張し、創造性とプロフェッショナルアプリケーションの両方に生成AIの境界を押し広げる。 Seedream 4.0はhttps://www.volcengine.com/experience/ark? launch=seedream

論文の概要: Seedream 4.0: Toward Next-generation Multimodal Image Generation

関連論文リスト