Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
- URL: http://arxiv.org/abs/2211.10950v1
- Date: Sun, 20 Nov 2022 11:22:24 GMT
- Title: Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
- Authors: Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, Wenhu Chen
- Abstract summary: We propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images.
This is the first work successfully leveraging diffusion models for coherent visual story synthesizing.
- Score: 33.69732363040526
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Conditioned diffusion models have demonstrated state-of-the-art text-to-image
synthesis capacity. Recently, most works focus on synthesizing independent
images; While for real-world applications, it is common and necessary to
generate a series of coherent images for story-stelling. In this work, we
mainly focus on story visualization and continuation tasks and propose AR-LDM,
a latent diffusion model auto-regressively conditioned on history captions and
generated images. Moreover, AR-LDM can generalize to new characters through
adaptation. To our best knowledge, this is the first work successfully
leveraging diffusion models for coherent visual story synthesizing.
Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV,
FlintstonesSV, and the newly introduced challenging dataset VIST containing
natural images. Large-scale human evaluations show that AR-LDM has superior
performance in terms of quality, relevance, and consistency.
Related papers
- MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - DiffuVST: Narrating Fictional Scenes with Global-History-Guided
Denoising Models [6.668241588219693]
Visual storytelling is increasingly desired beyond real-world imagery.
Current techniques, which typically use autoregressive decoders, suffer from low inference speed and are not well-suited for synthetic scenes.
We propose a novel diffusion-based system DiffuVST, which models a series of visual descriptions as a single conditional denoising process.
arXiv Detail & Related papers (2023-12-12T08:40:38Z) - Multi-View Unsupervised Image Generation with Cross Attention Guidance [23.07929124170851]
This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets.
We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts.
Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
arXiv Detail & Related papers (2023-12-07T14:55:13Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - Improved Visual Story Generation with Adaptive Context Modeling [39.04249009170821]
We present a simple method that improves the leading system with adaptive context modeling.
We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios.
arXiv Detail & Related papers (2023-05-26T10:43:42Z) - Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes.
We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z) - Consistent View Synthesis with Pose-Guided Diffusion Models [51.37925069307313]
Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications.
We propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image.
arXiv Detail & Related papers (2023-03-30T17:59:22Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - High-Resolution Image Synthesis with Latent Diffusion Models [14.786952412297808]
Training diffusion models on autoencoders allows for the first time to reach a near-optimal point between complexity reduction and detail preservation.
Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks.
arXiv Detail & Related papers (2021-12-20T18:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.