Related papers: LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

URL: http://arxiv.org/abs/2510.22946v2
Date: Wed, 29 Oct 2025 04:07:45 GMT
Title: LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Authors: Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie,
Abstract summary: We show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding.<n>Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks.<n>By training with only 35B tokens, this approach achieves strong results across multiple benchmarks.
Score: 48.02842078521973
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

Related papers

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing [67.77471070868852]
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing.<n>It is trained on only 50M samples, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.<n>By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
arXiv Detail & Related papers (2026-02-12T17:44:24Z)
DuoGen: Towards General Purpose Interleaved Multimodal Generation [65.13479486098419]
DuoGen is a general-purpose interleaved generation framework that addresses data curation, architecture design, and evaluation.<n>We build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites.<n>A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences.
arXiv Detail & Related papers (2026-01-31T04:35:15Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer [90.72238747690972]
We present Manzano, a simple and scalable unified framework for multimodal large language models.<n>A single vision encoder feeds two adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation.<n>A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels.
arXiv Detail & Related papers (2025-09-19T17:58:00Z)
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [30.494968865008513]
Recent text-to-image models struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex image generation.<n>We propose MENTOR, a novel framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation.<n>Our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
arXiv Detail & Related papers (2025-07-13T10:52:59Z)
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation [81.92275347127833]
A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation.<n>In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture.<n>Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation.
arXiv Detail & Related papers (2025-06-12T06:37:34Z)
Emerging Properties in Unified Multimodal Pretraining [32.856334401494145]
We introduce BAGEL, an open-source foundational model that supports multimodal understanding and generation.<n>BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data.<n>It significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks.
arXiv Detail & Related papers (2025-05-20T17:59:30Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs [26.462946557604177]
EasyGen is designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs) Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space.
arXiv Detail & Related papers (2023-10-13T08:38:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.