Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- URL: http://arxiv.org/abs/2503.21758v1
- Date: Thu, 27 Mar 2025 17:57:07 GMT
- Title: Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
- Authors: Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao,
- Abstract summary: Lumina-Image 2.0 is a text-to-image generation framework that achieves significant progress compared to previous work.<n>It adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence.<n>We introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks.
- Score: 76.44331001702379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.
Related papers
- VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning [40.75264235359017]
We present VARGPT-v1.1, an advanced unified visual autoregressive model.
The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis.
It achieves state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks.
arXiv Detail & Related papers (2025-04-03T18:06:28Z) - ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning [89.19449553099747]
We study the problem of Text-to-Image In-Context Learning (T2I-ICL)<n>We propose a framework that incorporates a thought process called ImageGen-CoT prior to image generation.<n>We fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities.
arXiv Detail & Related papers (2025-03-25T03:18:46Z) - MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis [18.876109299162138]
We introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE)
This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component.
MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks.
arXiv Detail & Related papers (2024-07-10T12:52:49Z) - Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency.
Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time.
This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.