A proto-object based audiovisual saliency map
- URL: http://arxiv.org/abs/2003.06779v1
- Date: Sun, 15 Mar 2020 08:34:35 GMT
- Title: A proto-object based audiovisual saliency map
- Authors: Sudarshan Ramenahalli
- Abstract summary: We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes.
Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural environment and our interaction with it is essentially multisensory,
where we may deploy visual, tactile and/or auditory senses to perceive, learn
and interact with our environment. Our objective in this study is to develop a
scene analysis algorithm using multisensory information, specifically vision
and audio. We develop a proto-object based audiovisual saliency map (AVSM) for
the analysis of dynamic natural scenes. A specialized audiovisual camera with
$360 \degree$ Field of View, capable of locating sound direction, is used to
collect spatiotemporally aligned audiovisual data. We demonstrate that the
performance of proto-object based audiovisual saliency map in detecting and
localizing salient objects/events is in agreement with human judgment. In
addition, the proto-object based AVSM that we compute as a linear combination
of visual and auditory feature conspicuity maps captures a higher number of
valid salient events compared to unisensory saliency maps. Such an algorithm
can be useful in surveillance, robotic navigation, video compression and
related applications.
Related papers
- You Only Speak Once to See [24.889319740761827]
Grounding objects in images using visual cues is a well-established approach in computer vision.
We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes.
Experimental results indicate that audio guidance can be effectively applied to object grounding.
arXiv Detail & Related papers (2024-09-27T01:16:15Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Bio-Inspired Audio-Visual Cues Integration for Visual Attention
Prediction [15.679379904130908]
Visual Attention Prediction (VAP) methods simulates the human selective attention mechanism to perceive the scene.
A bio-inspired audio-visual cues integration method is proposed for the VAP task, which explores the audio modality to better predict the visual attention map.
Experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD.
arXiv Detail & Related papers (2021-09-17T06:49:43Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition [61.54648991466747]
We explore an audiovisual aerial scene recognition task using both images and sounds as input.
We show the benefit of exploiting the audio information for the aerial scene recognition.
arXiv Detail & Related papers (2020-05-18T04:14:16Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.