Connecting the Dots between Audio and Text without Parallel Data through
Visual Knowledge Transfer
- URL: http://arxiv.org/abs/2112.08995v1
- Date: Thu, 16 Dec 2021 16:22:10 GMT
- Title: Connecting the Dots between Audio and Text without Parallel Data through
Visual Knowledge Transfer
- Authors: Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers,
Yejin Choi
- Abstract summary: VIP-ANT induces textbfAudio-textbfText alignment without using parallel audio-text data.
Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
- Score: 40.85506152074302
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machines that can represent and describe environmental soundscapes have
practical potential, e.g., for audio tagging and captioning systems. Prevailing
learning paradigms have been relying on parallel audio-text data, which is,
however, scarcely available on the web. We propose VIP-ANT that induces
\textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text
data. Our key idea is to share the image modality between bi-modal image-text
representations and bi-modal image-audio representations; the image modality
functions as a pivot and connects audio and text in a tri-modal embedding space
implicitly.
In a difficult zero-shot setting with no paired audio-text data, our model
demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio
classification tasks, and even surpasses the supervised state of the art for
Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further
investigate cases of minimal audio-text supervision, finding that, e.g., just a
few hundred supervised audio-text pairs increase the zero-shot audio
classification accuracy by 8\% on US8K. However, to match human parity on some
zero-shot tasks, our empirical scaling experiments suggest that we would need
about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new
avenues for learning audio-text connections with little to no parallel
audio-text data.
Related papers
- Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Killing two birds with one stone: Can an audio captioning system also be
used for audio-text retrieval? [0.0]
This work investigates the relationship between Audio-Text Retrieval (ATR) and Automated Audio Captioning (AAC)
For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair.
Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach.
arXiv Detail & Related papers (2023-08-29T07:53:17Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA.
Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings.
Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Audio Retrieval with WavText5K and CLAP Training [8.362098382773265]
We propose a new collection of about five thousand web audio-text pairs that we refer to as WavText5K.
When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets.
Our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective.
arXiv Detail & Related papers (2022-09-28T17:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.