Fugu-MT 論文翻訳(概要): Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

論文の概要: Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

arxiv url: http://arxiv.org/abs/2603.25706v2
Date: Mon, 30 Mar 2026 03:26:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 13:48:18.829351
Title: Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
Title（参考訳）: Wan-Weaver:デカップリングトレーニングによるインターリーブマルチモーダル生成
Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang,
Abstract要約: プランナとビジュアライザで構成されるフレームワークを導入する。ビジュアライザは画像の合成を行うのに対し、プランナーはビジュアルコンテンツのための密集したテキスト記述を生成する。これらのデザインは、長距離テキストコヒーレンスと視覚的一貫性を備えた創発的なインターリーブ生成能力を示すWan-Weaverを生み出している。
参考スコア（独自算出の注目度）: 68.94182767962914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.
Abstract（参考訳）: 最近の統一モデルは、理解と生成の両方において前例のない進歩を遂げた。しかし、多くはマルチモーダル入力を受け入れるが、通常は単一モーダル出力のみを生成する。このインターリーブドコンテンツ作成の課題は、主にデータ不足のトレーニングと、長距離クロスモーダルコンテキストのモデル化の難しさにある。この問題に対処するため,インターリーブド・ジェネレーションをテキスト・プランニングとビジュアル・一貫性・モデリングに分解し,プランナとビジュアライザからなるフレームワークを導入する。ビジュアライザは画像の合成を行うのに対し、プランナーはビジュアルコンテンツのための密集したテキスト記述を生成する。本研究では,大規模なテキスト・プロキシ・インターリーブド・データ(視覚内容がテキストで表現される)を構築し,参照誘導画像データをキュレートしてビジュアライザを訓練する。これらのデザインは、長距離テキストコヒーレンスと視覚的一貫性を備えた創発的なインターリーブ生成能力を示すWan-Weaverを生み出している。一方、多種多様な理解と生成データをプランナートレーニングに統合することで、Wan-Weaverは堅牢なタスク推論と生成能力を達成することができる。インターリーブド・ジェネレーションにおけるモデルの性能を評価するため,複数の次元にまたがる幅広いユースケースにまたがるベンチマークを構築した。大規模な実験では、実際のインターリーブされたデータにアクセスしなくても、Wan-Weaverは既存のメソッドよりも優れたパフォーマンスを実現している。

論文の概要: Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

関連論文リスト