DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
- URL: http://arxiv.org/abs/2602.12205v2
- Date: Fri, 13 Feb 2026 16:15:45 GMT
- Title: DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
- Authors: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang,
- Abstract summary: DeepGen 1.0 is a lightweight 5B unified model for image generation and editing.<n>It is trained on only 50M samples, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.<n>By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
- Score: 67.77471070868852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
Related papers
- DuoGen: Towards General Purpose Interleaved Multimodal Generation [65.13479486098419]
DuoGen is a general-purpose interleaved generation framework that addresses data curation, architecture design, and evaluation.<n>We build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites.<n>A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences.
arXiv Detail & Related papers (2026-01-31T04:35:15Z) - LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation [48.02842078521973]
We show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding.<n>Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks.<n>By training with only 35B tokens, this approach achieves strong results across multiple benchmarks.
arXiv Detail & Related papers (2025-10-27T02:59:57Z) - Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model [28.559525134847828]
We present UniPic2-SD3.5M-Kontext, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework.<n>Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data.<n>UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters.
arXiv Detail & Related papers (2025-09-04T17:00:17Z) - MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models [30.494968865008513]
Recent text-to-image models struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex image generation.<n>We propose MENTOR, a novel framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation.<n>Our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods.
arXiv Detail & Related papers (2025-07-13T10:52:59Z) - KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model [63.13906424204078]
We propose KaLM-Embedding-V2, a series of versatile and compact embedding models.<n>For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings.<n>For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation.
arXiv Detail & Related papers (2025-06-26T01:09:44Z) - Policy Optimized Text-to-Image Pipeline Design [73.9633527029941]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z) - BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.1967962502411]
We introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features.<n>A sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation offers practical advantages.<n>Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.
arXiv Detail & Related papers (2025-05-14T17:11:07Z) - EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs [26.462946557604177]
EasyGen is designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs)
Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space.
arXiv Detail & Related papers (2023-10-13T08:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.