ImageBART: Bidirectional Context with Multinomial Diffusion for
Autoregressive Image Synthesis
- URL: http://arxiv.org/abs/2108.08827v1
- Date: Thu, 19 Aug 2021 17:50:07 GMT
- Title: ImageBART: Bidirectional Context with Multinomial Diffusion for
Autoregressive Image Synthesis
- Authors: Patrick Esser and Robin Rombach and Andreas Blattmann and Bj\"orn
Ommer
- Abstract summary: Autoregressive models incorporate context in a linear 1D order by attending only to previously synthesized image patches above or to the left.
We propose a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process.
Our approach can take unrestricted, user-provided masks into account to perform local image editing.
- Score: 15.006676130258372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive models and their sequential factorization of the data
likelihood have recently demonstrated great potential for image representation
and synthesis. Nevertheless, they incorporate image context in a linear 1D
order by attending only to previously synthesized image patches above or to the
left. Not only is this unidirectional, sequential bias of attention unnatural
for images as it disregards large parts of a scene until synthesis is almost
complete. It also processes the entire image on a single scale, thus ignoring
more global contextual information up to the gist of the entire scene. As a
remedy we incorporate a coarse-to-fine hierarchy of context by combining the
autoregressive formulation with a multinomial diffusion process: Whereas a
multistage diffusion process successively removes information to coarsen an
image, we train a (short) Markov chain to invert this process. In each stage,
the resulting autoregressive ImageBART model progressively incorporates context
from previous stages in a coarse-to-fine manner. Experiments show greatly
improved image modification capabilities over autoregressive models while also
providing high-fidelity image generation, both of which are enabled through
efficient training in a compressed latent space. Specifically, our approach can
take unrestricted, user-provided masks into account to perform local image
editing. Thus, in contrast to pure autoregressive models, it can solve
free-form image inpainting and, in the case of conditional models, local,
text-guided image modification without requiring mask-specific training.
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - Improving Diffusion-based Image Translation using Asymmetric Gradient
Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance.
Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models.
Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Recurrent Affine Transformation for Text-to-image Synthesis [5.256132101498471]
Existing methods usually adaptively fuse suitable text information into the synthesis process with isolated fusion blocks.
We propose a Recurrent Affine Transformation (RAT) for Generative Adrial Networks that connects all the fusion blocks with a recurrent neural network to model their long-term dependency.
Being aware of matching image regions, text descriptions supervise the generator to synthesize more relevant image contents.
arXiv Detail & Related papers (2022-04-22T03:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.