Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment
for Markup-to-Image Generation
- URL: http://arxiv.org/abs/2308.01147v1
- Date: Wed, 2 Aug 2023 13:43:03 GMT
- Title: Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment
for Markup-to-Image Generation
- Authors: Guojin Zhong, Jin Yuan, Pan Wang, Kailun Yang, Weili Guan, Zhiyong Li
- Abstract summary: This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM)
FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation.
Experiments are conducted on four benchmark datasets from different domains.
- Score: 15.411325887412413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently rising markup-to-image generation poses greater challenges as
compared to natural image generation, due to its low tolerance for errors as
well as the complex sequence and context correlations between markup and
rendered image. This paper proposes a novel model named "Contrast-augmented
Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which
introduces contrastive positive/negative samples into the diffusion model to
boost performance for markup-to-image generation. Technically, we design a
fine-grained cross-modal alignment module to well explore the sequence
similarity between the two modalities for learning robust feature
representations. To improve the generalization ability, we propose a
contrast-augmented diffusion model to explicitly explore positive and negative
samples by maximizing a novel contrastive variational objective, which is
mathematically inferred to provide a tighter bound for the model's
optimization. Moreover, the context-aware cross attention module is developed
to capture the contextual information within markup language during the
denoising process, yielding better noise prediction results. Extensive
experiments are conducted on four benchmark datasets from different domains,
and the experimental results demonstrate the effectiveness of the proposed
components in FSA-CDM, significantly exceeding state-of-the-art performance by
about 2%-12% DTW improvements. The code will be released at
https://github.com/zgj77/FSACDM.
Related papers
- Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation [0.40792653193642503]
We identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models.
We propose a semantic approach, using a pairwise mean CLIP score as our semantic consistency score.
arXiv Detail & Related papers (2024-04-12T20:16:03Z) - Provably Robust Score-Based Diffusion Posterior Sampling for Plug-and-Play Image Reconstruction [31.503662384666274]
In science and engineering, the goal is to infer an unknown image from a small number of measurements collected from a known forward model describing certain imaging modality.
Motivated Score-based diffusion models, due to its empirical success, have emerged as an impressive candidate of an exemplary prior in image reconstruction.
arXiv Detail & Related papers (2024-03-25T15:58:26Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired
Image-to-Image Translation [96.11061713135385]
This work presents a new score-decomposed diffusion model to explicitly optimize the tangled distributions during image generation.
We equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold.
SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.
arXiv Detail & Related papers (2023-08-04T06:21:57Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - Hierarchical Integration Diffusion Model for Realistic Image Deblurring [71.76410266003917]
Diffusion models (DMs) have been introduced in image deblurring and exhibited promising performance.
We propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring.
Experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T12:18:20Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.