Text-to-Audio Generation Synchronized with Videos
- URL: http://arxiv.org/abs/2403.07938v1
- Date: Fri, 8 Mar 2024 22:27:38 GMT
- Title: Text-to-Audio Generation Synchronized with Videos
- Authors: Shentong Mo, Jing Shi, Yapeng Tian
- Abstract summary: We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
- Score: 44.848393652233796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified,
as researchers strive to synthesize audio from textual descriptions. However,
most existing methods, though leveraging latent diffusion models to learn the
correlation between audio and text embeddings, fall short when it comes to
maintaining a seamless synchronization between the produced audio and its
video. This often results in discernible audio-visual mismatches. To bridge
this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation
that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself
with three novel metrics dedicated to evaluating visual alignment and temporal
consistency. To complement this, we also present a simple yet effective
video-aligned TTA generation model, namely T2AV. Moving beyond traditional
methods, T2AV refines the latent diffusion approach by integrating
visual-aligned text embeddings as its conditional foundation. It employs a
temporal multi-head attention transformer to extract and understand temporal
nuances from video data, a feat amplified by our Audio-Visual ControlNet that
adeptly merges temporal visual representations with text embeddings. Further
enhancing this integration, we weave in a contrastive learning objective,
designed to ensure that the visual-aligned text embeddings resonate closely
with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench
demonstrate that our T2AV sets a new standard for video-aligned TTA generation
in ensuring visual alignment and temporal consistency.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation [13.626626326590086]
We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task.
Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource.
Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
arXiv Detail & Related papers (2024-01-02T05:42:14Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z) - AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion [27.47320496383661]
We introduce a novel T2V framework that additionally employ audio signals to control the temporal dynamics.
We propose audio-based regional editing and signal smoothing to strike a good balance between the two contradicting desiderata of video synthesis.
arXiv Detail & Related papers (2023-05-06T10:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.