DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment
- URL: http://arxiv.org/abs/2305.12903v1
- Date: Mon, 22 May 2023 10:37:27 GMT
- Title: DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment
- Authors: Shentong Mo, Jing Shi, Yapeng Tian
- Abstract summary: We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
- Score: 30.38594416942543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-audio (TTA) generation is a recent popular problem that aims to
synthesize general audio given text descriptions. Previous methods utilized
latent diffusion models to learn audio embedding in a latent space with text
embedding as the condition. However, they ignored the synchronization between
audio and visual content in the video, and tended to generate audio mismatching
from video frames. In this work, we propose a novel and personalized
text-to-sound generation approach with visual alignment based on latent
diffusion models, namely DiffAVA, that can simply fine-tune lightweight
visual-text alignment modules with frozen modality-specific encoders to update
visual-aligned text embeddings as the condition. Specifically, our DiffAVA
leverages a multi-head attention transformer to aggregate temporal information
from video features, and a dual multi-modal residual network to fuse temporal
visual representations with text embeddings. Then, a contrastive learning
objective is applied to match visual-aligned text embeddings with audio
features. Experimental results on the AudioCaps dataset demonstrate that the
proposed DiffAVA can achieve competitive performance on visual-aligned
text-to-audio generation.
Related papers
- Tell What You Hear From What You See -- Video to Audio Generation Through Text [17.95017332858846]
VATT is a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio.
VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions.
arXiv Detail & Related papers (2024-11-08T16:29:07Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Sounding Video Generator: A Unified Framework for Text-guided Sounding
Video Generation [24.403772976932487]
Sounding Video Generator (SVG) is a unified framework for generating realistic videos along with audio signals.
VQGAN transforms visual frames and audio melspectrograms into discrete tokens.
Transformer-based decoder is used to model associations between texts, visual frames, and audio signals.
arXiv Detail & Related papers (2023-03-29T09:07:31Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.