Related papers: LVLM-Composer's Explicit Planning for Image Generation

LVLM-Composer's Explicit Planning for Image Generation

URL: http://arxiv.org/abs/2507.04152v1
Date: Sat, 05 Jul 2025 20:21:03 GMT
Title: LVLM-Composer's Explicit Planning for Image Generation
Authors: Spencer Ramsey, Jeffrey Lee, Amina Grant,
Abstract summary: We introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis.<n>Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation.<n>Experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.

Related papers

LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation [1.124958340749622]
Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following.<n>LumiGen is a novel LVLM-enhanced iterative framework designed to elevate T2I model performance.<n>LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-05T20:53:43Z)
Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation [42.78181795494584]
generative model designed to significantly advance text-to-image synthesis.<n>Hi-SSLVLM addresses limitations through a unique two-stage self-supervised learning strategy.<n> experiments demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics.
arXiv Detail & Related papers (2025-07-05T20:16:32Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning [47.592351387052545]
GoT-R1 is a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation.<n>We propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output.<n> Experimental results demonstrate significant improvements on T2I-CompBench benchmark.
arXiv Detail & Related papers (2025-05-22T17:59:58Z)
CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback [58.27353205269664]
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts.<n>However, they struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations.<n>We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships.
arXiv Detail & Related papers (2025-05-16T12:23:58Z)
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.<n>We propose a novel framework based on multimodal retrieval-augmented generation (RAG)<n>RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models [19.054780489639793]
This paper introduces Progressive multi-granular Vision-Language alignments (PromViL)<n>Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts.<n>By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning.
arXiv Detail & Related papers (2024-12-11T06:21:33Z)
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models [85.10375181040436]
We propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating Vision-Language Models. We find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons.
arXiv Detail & Related papers (2024-10-13T05:35:09Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z)
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model [108.42241250772643]
We introduce InternLM-XComposer2, a vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content.
arXiv Detail & Related papers (2024-01-29T18:59:02Z)
Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.