Audiovisual SlowFast Networks for Video Recognition
- URL: http://arxiv.org/abs/2001.08740v2
- Date: Mon, 9 Mar 2020 00:50:19 GMT
- Title: Audiovisual SlowFast Networks for Video Recognition
- Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph
Feichtenhofer
- Abstract summary: We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts.
We report results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.
- Score: 140.08143162600354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Audiovisual SlowFast Networks, an architecture for integrated
audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are
deeply integrated with a Faster Audio pathway to model vision and sound in a
unified representation. We fuse audio and visual features at multiple layers,
enabling audio to contribute to the formation of hierarchical audiovisual
concepts. To overcome training difficulties that arise from different learning
dynamics for audio and visual modalities, we introduce DropPathway, which
randomly drops the Audio pathway during training as an effective regularization
technique. Inspired by prior studies in neuroscience, we perform hierarchical
audiovisual synchronization to learn joint audiovisual features. We report
state-of-the-art results on six video action classification and detection
datasets, perform detailed ablation studies, and show the generalization of
AVSlowFast to learn self-supervised audiovisual features. Code will be made
available at: https://github.com/facebookresearch/SlowFast.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.