Related papers: Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

URL: http://arxiv.org/abs/2507.04151v1
Date: Sat, 05 Jul 2025 20:16:32 GMT
Title: Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation
Authors: Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young,
Abstract summary: generative model designed to significantly advance text-to-image synthesis.<n>Hi-SSLVLM addresses limitations through a unique two-stage self-supervised learning strategy.<n> experiments demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics.
Score: 42.78181795494584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

Related papers

LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation [1.124958340749622]
Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following.<n>LumiGen is a novel LVLM-enhanced iterative framework designed to elevate T2I model performance.<n>LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-05T20:53:43Z)
LVLM-Composer's Explicit Planning for Image Generation [0.0]
We introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis.<n>Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation.<n>Experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions.
arXiv Detail & Related papers (2025-07-05T20:21:03Z)
CoMemo: LVLMs Need Image Context with Image Memory [51.681858871027345]
CoMemo is a dual-path architecture that combines a Context image path with an image Memory path for visual processing.<n>We introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness.
arXiv Detail & Related papers (2025-06-06T17:59:06Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models [0.7366405857677226]
Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy.<n>VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features.<n>It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
arXiv Detail & Related papers (2025-01-01T18:27:13Z)
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance [47.53085562765585]
We introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model.<n>To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer.<n>To promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme.
arXiv Detail & Related papers (2024-12-09T17:11:50Z)
STAR: Scale-wise Text-conditioned AutoRegressive image generation [38.98271279816512]
We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm.<n> STAR enables text-driven image generation up to 1024$times$1024 through three key designs.
arXiv Detail & Related papers (2024-06-16T03:45:45Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model [108.42241250772643]
We introduce InternLM-XComposer2, a vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content.
arXiv Detail & Related papers (2024-01-29T18:59:02Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
Planting a SEED of Vision in Large Language Model [73.17530130368053]
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the ability to SEE and Draw at the same time. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs.
arXiv Detail & Related papers (2023-07-16T13:41:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.