Exploring the Role of Audio in Video Captioning
- URL: http://arxiv.org/abs/2306.12559v1
- Date: Wed, 21 Jun 2023 20:54:52 GMT
- Title: Exploring the Role of Audio in Video Captioning
- Authors: Yuhan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar,
Heng Wang
- Abstract summary: We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
- Score: 59.679122191706426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent focus in video captioning has been on designing architectures that can
consume both video and text modalities, and using large-scale video datasets
with text transcripts for pre-training, such as HowTo100M. Though these
approaches have achieved significant improvement, the audio modality is often
ignored in video captioning. In this work, we present an audio-visual
framework, which aims to fully exploit the potential of the audio modality for
captioning. Instead of relying on text transcripts extracted via automatic
speech recognition (ASR), we argue that learning with raw audio signals can be
more beneficial, as audio has additional information including acoustic events,
speaker identity, etc. Our contributions are twofold. First, we observed that
the model overspecializes to the audio modality when pre-training with both
video and audio modality, since the ground truth (i.e., text transcripts) can
be solely predicted using audio. We proposed a Modality Balanced Pre-training
(MBP) loss to mitigate this issue and significantly improve the performance on
downstream tasks. Second, we slice and dice different design choices of the
cross-modal module, which may become an information bottleneck and generate
inferior results. We proposed new local-global fusion mechanisms to improve
information exchange across audio and video. We demonstrate significant
improvements by leveraging the audio modality on four datasets, and even
outperform the state of the art on some metrics without relying on the text
modality as the input.
Related papers
- video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.