TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation
- URL: http://arxiv.org/abs/2110.13412v1
- Date: Tue, 26 Oct 2021 04:50:42 GMT
- Title: TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation
- Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal
- Abstract summary: We introduce TriBERT -- a transformer-based architecture inspired by ViLBERT.
TriBERT enables contextual feature learning across three modalities: vision, pose, and audio.
We show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks.
- Score: 35.93516937521393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of transformer models in language, such as BERT, has
motivated the use of such architectures for multi-modal feature learning and
tasks. However, most multi-modal variants (e.g., ViLBERT) have limited
themselves to visual-linguistic data. Relatively few have explored its use in
audio-visual modalities, and none, to our knowledge, illustrate them in the
context of granular audio-visual detection or segmentation tasks such as sound
source separation and localization. In this work, we introduce TriBERT -- a
transformer-based architecture, inspired by ViLBERT, which enables contextual
feature learning across three modalities: vision, pose, and audio, with the use
of flexible co-attention. The use of pose keypoints is inspired by recent works
that illustrate that such representations can significantly boost performance
in many audio-visual scenarios where often one or more persons are responsible
for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as
a function of human manipulating an object). From a technical perspective, as
part of the TriBERT architecture, we introduce a learned visual tokenization
scheme based on spatial attention and leverage weak-supervision to allow
granular cross-modal interactions for visual and pose modalities. Further, we
supplement learning with sound-source separation loss formulated across all
three streams. We pre-train our model on the large MUSIC21 dataset and
demonstrate improved performance in audio-visual sound source separation on
that dataset as well as other datasets through fine-tuning. In addition, we
show that the learned TriBERT representations are generic and significantly
improve performance on other audio-visual tasks such as cross-modal
audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.
Related papers
- Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization [4.062872727927056]
The goal of Multilingual Visual Answer localization (MVAL) is to locate a video segment that answers a given multilingual question.
Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.
We propose a unified Audio-Visual-Textual Span localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations.
arXiv Detail & Related papers (2024-11-05T06:49:14Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.