Egocentric Audio-Visual Object Localization
- URL: http://arxiv.org/abs/2303.13471v1
- Date: Thu, 23 Mar 2023 17:43:11 GMT
- Title: Egocentric Audio-Visual Object Localization
- Authors: Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
- Abstract summary: We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
- Score: 51.434212424829525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans naturally perceive surrounding scenes by unifying sound and sight in a
first-person view. Likewise, machines are advanced to approach human
intelligence by learning with multisensory inputs from an egocentric
perspective. In this paper, we explore the challenging egocentric audio-visual
object localization task and observe that 1) egomotion commonly exists in
first-person recordings, even within a short duration; 2) The out-of-view sound
components can be created while wearers shift their attention. To address the
first problem, we propose a geometry-aware temporal aggregation module to
handle the egomotion explicitly. The effect of egomotion is mitigated by
estimating the temporal geometry transformation and exploiting it to update
visual representations. Moreover, we propose a cascaded feature enhancement
module to tackle the second issue. It improves cross-modal localization
robustness by disentangling visually-indicated audio representation. During
training, we take advantage of the naturally available audio-visual temporal
synchronization as the ``free'' self-supervision to avoid costly labeling. We
also annotate and create the Epic Sounding Object dataset for evaluation
purposes. Extensive experiments show that our method achieves state-of-the-art
localization performance in egocentric videos and can be generalized to diverse
audio-visual scenes.
Related papers
- Spherical World-Locking for Audio-Visual Localization in Egocentric Videos [53.658928180166534]
We propose Spherical World-Locking as a general framework for egocentric scene representation.
Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion.
We design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation.
arXiv Detail & Related papers (2024-08-09T22:29:04Z) - Cross-modal Generative Model for Visual-Guided Binaural Stereo
Generation [18.607236792587614]
We propose a visually guided generative adversarial approach for generating stereo audio from mono audio.
A metric to measure the spatial perception of audio is proposed for the first time.
The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics.
arXiv Detail & Related papers (2023-11-13T09:53:14Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization [13.144367063836597]
We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results.
Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
arXiv Detail & Related papers (2022-01-06T05:40:16Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.