Related papers: DuoGen: Towards General Purpose Interleaved Multimodal Generation

DuoGen: Towards General Purpose Interleaved Multimodal Generation

URL: http://arxiv.org/abs/2602.00508v2
Date: Tue, 03 Feb 2026 02:54:14 GMT
Title: DuoGen: Towards General Purpose Interleaved Multimodal Generation
Authors: Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi,
Abstract summary: DuoGen is a general-purpose interleaved generation framework that addresses data curation, architecture design, and evaluation.<n>We build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites.<n>A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences.
Score: 65.13479486098419
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at https://research.nvidia.com/labs/dir/duogen/.

Related papers

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models [12.21686773633269]
Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework.<n>Such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks.<n>We propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications.
arXiv Detail & Related papers (2025-09-26T01:43:40Z)
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.1967962502411]
We introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features.<n>A sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation offers practical advantages.<n>Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.
arXiv Detail & Related papers (2025-05-14T17:11:07Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space [9.327655601475605]
We propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space.<n>To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy.<n>Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks.
arXiv Detail & Related papers (2025-04-30T06:30:48Z)
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat [51.78489661490396]
We introduce WeGen, a model that unifies multimodal generation and understanding.<n>It can generate diverse results with high creativity for less detailed instructions.<n>We show it achieves state-of-the-art performance across various visual generation benchmarks.
arXiv Detail & Related papers (2025-03-03T02:50:07Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Generative Visual Instruction Tuning [11.727612242016871]
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model. We produce GenLLaVA, a Generative Large Language and Visual Assistant. Our model demonstrates visual understanding capabilities superior to LLaVA and demonstrates competitive results with native multimodal models.
arXiv Detail & Related papers (2024-06-17T07:06:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.