Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
- URL: http://arxiv.org/abs/2201.01928v1
- Date: Thu, 6 Jan 2022 05:40:16 GMT
- Title: Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
- Authors: Hao Jiang, Calvin Murdock, Vamsi Krishna Ithapu
- Abstract summary: We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results.
Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
- Score: 13.144367063836597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Augmented reality devices have the potential to enhance human perception and
enable other assistive functionalities in complex conversational environments.
Effectively capturing the audio-visual context necessary for understanding
these social interactions first requires detecting and localizing the voice
activities of the device wearer and the surrounding people. These tasks are
challenging due to their egocentric nature: the wearer's head motion may cause
motion blur, surrounding people may appear in difficult viewing angles, and
there may be occlusions, visual clutter, audio noise, and bad lighting. Under
these conditions, previous state-of-the-art active speaker detection methods do
not give satisfactory results. Instead, we tackle the problem from a new
setting using both video and multi-channel microphone array audio. We propose a
novel end-to-end deep learning approach that is able to give robust voice
activity detection and localization results. In contrast to previous methods,
our method localizes active speakers from all possible directions on the
sphere, even outside the camera's field of view, while simultaneously detecting
the device wearer's own voice activity. Our experiments show that the proposed
method gives superior results, can run in real time, and is robust against
noise and clutter.
Related papers
- You Only Speak Once to See [24.889319740761827]
Grounding objects in images using visual cues is a well-established approach in computer vision.
We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes.
Experimental results indicate that audio guidance can be effectively applied to object grounding.
arXiv Detail & Related papers (2024-09-27T01:16:15Z) - Egocentric Auditory Attention Localization in Conversations [25.736198724595486]
We propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention.
Our approach leverages features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset.
arXiv Detail & Related papers (2023-03-28T14:52:03Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - Egocentric Audio-Visual Noise Suppression [11.113020254726292]
This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
arXiv Detail & Related papers (2022-11-07T15:53:12Z) - No-audio speaking status detection in crowded settings via visual
pose-based filtering and wearable acceleration [8.710774926703321]
Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way.
We show that the selection of local features around pose keypoints has a positive effect on generalization performance.
We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.
arXiv Detail & Related papers (2022-11-01T15:55:48Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.