Related papers: Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

URL: http://arxiv.org/abs/2501.00917v1
Date: Wed, 01 Jan 2025 18:27:13 GMT
Title: Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models
Authors: Emily Johnson, Noah Wilson,
Abstract summary: Vision-Language Aligned Diffusion (VLAD) model is a generative framework that addresses challenges through a dual-stream strategy.<n>VLAD decomposes textual prompts into global and local representations, ensuring precise alignment with visual features.<n>It incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images.
Score: 0.7366405857677226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image quality, semantic alignment, and text rendering accuracy. Human evaluations further validate the superior performance of VLAD, making it a promising approach for text-to-image generation in complex scenarios.

Related papers

Visual Semantic Description Generation with MLLMs for Image-Text Matching [7.246705430021142]
We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantics.<n>Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency.
arXiv Detail & Related papers (2025-07-11T13:38:01Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models. Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process. Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.<n>Expanding images based on text requires global coherence and rich local context understanding.<n>We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z)
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists [45.800383191637785]
Vision-Language Generalists (VLGs) capable of understanding and generating both text and images.<n>One primary limitation lies in applying a unified architecture and the same set of parameters to simultaneously model discrete text tokens and continuous image features.<n>Recent works attempt to tackle this fundamental problem by introducing modality-aware expert models.<n>We introduce MODALITY-SPECIALIZED SYNERGIZERS (MOSS), a novel design that efficiently optimize existing unified architectures of VLGs.
arXiv Detail & Related papers (2024-07-04T03:28:22Z)
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework. Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z)
Improving Compositional Text-to-image Generation with Large Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts. We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.