BLIP3o-NEXT: Next Frontier of Native Image Generation
- URL: http://arxiv.org/abs/2510.15857v1
- Date: Fri, 17 Oct 2025 17:50:58 GMT
- Title: BLIP3o-NEXT: Next Frontier of Native Image Generation
- Authors: Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu,
- Abstract summary: We present BLIP3o, a fully open foundation model in the BLIP3 series that advances the next frontier of native image generation.<n>BLIP3o unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities.
- Score: 113.25832679864631
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
Related papers
- Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [118.52589065972795]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.1967962502411]
We introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features.<n>A sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation offers practical advantages.<n>Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.
arXiv Detail & Related papers (2025-05-14T17:11:07Z) - Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z) - Boosting Generative Image Modeling via Joint Image-Feature Synthesis [15.133906625258797]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents.<n>Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise.<n>By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z) - Illustrious: an Open Advanced Illustration Model [7.428509329724737]
We develop a text-to-image anime image generative model, called Illustrious, to achieve high resolution, dynamic color range images, and high restoration ability.
We focus on three critical approaches for model improvement. First, we delve into the significance of the batch size and dropout control, which enables faster learning of controllable token based concept activations.
Second, we increase the training resolution of images, affecting the accurate depiction of character anatomy in much higher resolution, extending its generation capability over 20MP with proper methods.
arXiv Detail & Related papers (2024-09-30T04:59:12Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.