AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion
- URL: http://arxiv.org/abs/2305.04001v2
- Date: Tue, 23 May 2023 06:59:30 GMT
- Title: AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion
- Authors: Seungwoo Lee, Chaerin Kong, Donghyeon Jeon, Nojun Kwak
- Abstract summary: We introduce a novel T2V framework that additionally employ audio signals to control the temporal dynamics.
We propose audio-based regional editing and signal smoothing to strike a good balance between the two contradicting desiderata of video synthesis.
- Score: 27.47320496383661
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in diffusion models have showcased promising results in the
text-to-video (T2V) synthesis task. However, as these T2V models solely employ
text as the guidance, they tend to struggle in modeling detailed temporal
dynamics. In this paper, we introduce a novel T2V framework that additionally
employ audio signals to control the temporal dynamics, empowering an
off-the-shelf T2I diffusion to generate audio-aligned videos. We propose
audio-based regional editing and signal smoothing to strike a good balance
between the two contradicting desiderata of video synthesis, i.e., temporal
flexibility and coherence. We empirically demonstrate the effectiveness of our
method through experiments, and further present practical applications for
contents creation.
Related papers
- Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation [13.626626326590086]
We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task.
Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource.
Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
arXiv Detail & Related papers (2024-01-02T05:42:14Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.