Related papers: LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation

URL: http://arxiv.org/abs/2508.04732v1
Date: Tue, 05 Aug 2025 20:53:43 GMT
Title: LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation
Authors: Xiaoqi Dong, Xiangyu Zhou, Nicholas Evans, Yujia Lin,
Abstract summary: Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following.<n>LumiGen is a novel LVLM-enhanced iterative framework designed to elevate T2I model performance.<n>LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines.
Score: 1.124958340749622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-Image (T2I) generation has made significant advancements with diffusion models, yet challenges persist in handling complex instructions, ensuring fine-grained content control, and maintaining deep semantic consistency. Existing T2I models often struggle with tasks like accurate text rendering, precise pose generation, or intricate compositional coherence. Concurrently, Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following. We propose LumiGen, a novel LVLM-enhanced iterative framework designed to elevate T2I model performance, particularly in areas requiring fine-grained control, through a closed-loop, LVLM-driven feedback mechanism. LumiGen comprises an Intelligent Prompt Parsing & Augmentation (IPPA) module for proactive prompt enhancement and an Iterative Visual Feedback & Refinement (IVFR) module, which acts as a "visual critic" to iteratively correct and optimize generated images. Evaluated on the challenging LongBench-T2I Benchmark, LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines. Notably, our framework demonstrates significant improvements in critical dimensions such as text rendering and pose expression, validating the effectiveness of LVLM integration for more controllable and higher-quality image generation.

Related papers

LVLM-Composer's Explicit Planning for Image Generation [0.0]
We introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis.<n>Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation.<n>Experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions.
arXiv Detail & Related papers (2025-07-05T20:21:03Z)
Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation [42.78181795494584]
generative model designed to significantly advance text-to-image synthesis.<n>Hi-SSLVLM addresses limitations through a unique two-stage self-supervised learning strategy.<n> experiments demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics.
arXiv Detail & Related papers (2025-07-05T20:16:32Z)
Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z)
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning [47.592351387052545]
GoT-R1 is a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation.<n>We propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output.<n> Experimental results demonstrate significant improvements on T2I-CompBench benchmark.
arXiv Detail & Related papers (2025-05-22T17:59:58Z)
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.<n>ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.<n>We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists [45.800383191637785]
Vision-Language Generalists (VLGs) capable of understanding and generating both text and images.<n>One primary limitation lies in applying a unified architecture and the same set of parameters to simultaneously model discrete text tokens and continuous image features.<n>Recent works attempt to tackle this fundamental problem by introducing modality-aware expert models.<n>We introduce MODALITY-SPECIALIZED SYNERGIZERS (MOSS), a novel design that efficiently optimize existing unified architectures of VLGs.
arXiv Detail & Related papers (2024-07-04T03:28:22Z)
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.