SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
- URL: http://arxiv.org/abs/2404.05206v1
- Date: Mon, 8 Apr 2024 05:19:28 GMT
- Title: SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
- Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman,
- Abstract summary: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
- Score: 77.55518265996312
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.