Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
- URL: http://arxiv.org/abs/2507.17801v1
- Date: Wed, 23 Jul 2025 17:42:13 GMT
- Title: Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
- Authors: Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, Peng Gao,
- Abstract summary: Lumina-mGPT 2.0 is a stand-alone, decoder-only autoregressive model.<n>It is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom.<n>It achieves generation quality on par with state-of-the-art diffusion models.
- Score: 80.30976039119236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.
Related papers
- Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation [66.73899356886652]
We build an image tokenizer directly atop pre-trained vision foundation models.<n>Our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality.<n>It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks.
arXiv Detail & Related papers (2025-07-11T09:32:45Z) - DanceGRPO: Unleashing GRPO on Visual Generation [36.36813831536346]
This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization to visual generation paradigms.<n>We show consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval.<n>Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.
arXiv Detail & Related papers (2025-05-12T17:59:34Z) - Lumina-Image 2.0: A Unified and Efficient Image Generative Framework [76.44331001702379]
Lumina-Image 2.0 is a text-to-image generation framework that achieves significant progress compared to previous work.<n>It adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence.<n>We introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks.
arXiv Detail & Related papers (2025-03-27T17:57:07Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [49.04935506942202]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.<n>By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency.
Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.