Learning Audio-Video Modalities from Image Captions
- URL: http://arxiv.org/abs/2204.00679v1
- Date: Fri, 1 Apr 2022 19:48:18 GMT
- Title: Learning Audio-Video Modalities from Image Captions
- Authors: Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago
Manen, Chen Sun and Cordelia Schmid
- Abstract summary: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data.
We propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips.
- Score: 62.772232865072745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major challenge in text-video and text-audio retrieval is the lack of
large-scale training data. This is unlike image-captioning, where datasets are
in the order of millions of samples. To close this gap we propose a new video
mining pipeline which involves transferring captions from image captioning
datasets to video clips with no additional manual effort. Using this pipeline,
we create a new large-scale, weakly labelled audio-video captioning dataset
consisting of millions of paired clips and captions. We show that training a
multimodal transformed based model on this data achieves competitive
performance on video retrieval and video captioning, matching or even
outperforming HowTo100M pretraining with 20x fewer clips. We also show that our
mined clips are suitable for text-audio pretraining, and achieve state of the
art results for the task of audio retrieval.
Related papers
- HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.