Fugu-MT 論文翻訳(概要): Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

論文の概要: Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

arxiv url: http://arxiv.org/abs/2604.19858v2
Date: Thu, 23 Apr 2026 15:33:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 13:34:22.029534
Title: Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Title（参考訳）: Wan-Image: ジェネレーティブ・ビジュアル・インテリジェンスの境界を押し上げる
Authors: Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, Jun Dan, Kai Zhu, Kang Zhao, Keyu Yan, Minghui Chen, Pandeng Li, Shuangle Chen, Tong Shen, Yu Liu, Yue Jiang, Yulin Pan, Yuxiang Tuo, Zeyinzi Jiang, Zhen Han, Ang Wang, Bang Zhang, Baole Ai, Bin Wen, Boang Feng, Feiwu Yu, Gang Wang, Haiming Zhao, He Kang, Jianjing Xiang, Jianyuan Zeng, Jinkai Wang, Junjie Zhou, Ke Sun, Linqian Wu, Pei Gong, Pingyu Wu, Ruiwen Wu, Tongtong Su, Wenmeng Zhou, Wenting Shen, Wenyuan Yu, Xianjun Xu, Xiaoming Huang, Xiejie Shen, Xin Xu, Yan Kou, Yangyu Lv, Yifan Zhai, Yitong Huang, Yun Zheng, Yuntao Hong, Zhe Zhang, Zhicheng Zhang,
Abstract要約: Wan-Imageは、パラダイムシフト画像生成モデルに設計された統合ビジュアル生成システムである。大規模マルチモーダルデータスケーリング、体系的な微粒化アノテーションエンジン、強化学習データによって実現されている。最終的にWan-Imageは、eコマース、エンターテイメント、教育、そして個人の生産性で視覚的コンテンツの創造に革命をもたらす。
参考スコア（独自算出の注目度）: 86.08534008471356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
Abstract（参考訳）: 我々は、カジュアルシンセサイザーからプロ級生産性ツールへのパラダイムシフト画像生成モデルに明示的に設計された統合ビジュアル生成システムであるWan-Imageを提案する。現代の拡散モデルは美的生成において優れているが、絶対的な制御性、複雑なタイポグラフィーレンダリング、厳密なアイデンティティ保存を必要とする厳密な設計ワークフローにおいて、しばしば重大なボトルネックに遭遇する。これらの課題に対処するため、Wan-Imageは、大きな言語モデルの認知能力を高忠実な拡散変換器の画素合成と相乗化することにより、ネイティブに統一されたマルチモーダルアーキテクチャを特徴としている。基本的には、大規模マルチモーダルデータスケーリング、体系的な微粒化アノテーションエンジン、および強化学習データを利用して、基礎的な命令を超越し、専門家レベルの専門的能力を解き放つ。その中には、超長い複雑なテキストレンダリング、ハイパーディバースポートレート生成、パレット誘導生成、多目的アイデンティティ保存、コヒーレントな逐次視覚生成、正確なマルチモーダルインタラクティブ編集、ネイティブアルファチャネル生成、高効率4K合成が含まれる。様々な人間の評価において、Wan-Image はSeedream 5.0 Lite と GPT Image 1.5 を上回り、Nano Banana Pro と同等の課題に到達した。最終的にWan-Imageは、電子商取引、エンターテイメント、教育、および個人の生産性にまたがるビジュアルコンテンツの創造に革命をもたらし、プロのビジュアル合成の境界を再定義する。

論文の概要: Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

関連論文リスト