Egocentric Auditory Attention Localization in Conversations
- URL: http://arxiv.org/abs/2303.16024v1
- Date: Tue, 28 Mar 2023 14:52:03 GMT
- Title: Egocentric Auditory Attention Localization in Conversations
- Authors: Fiona Ryan, Hao Jiang, Abhinav Shukla, James M. Rehg, Vamsi Krishna
Ithapu
- Abstract summary: We propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention.
Our approach leverages features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset.
- Score: 25.736198724595486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a noisy conversation environment such as a dinner party, people often
exhibit selective auditory attention, or the ability to focus on a particular
speaker while tuning out others. Recognizing who somebody is listening to in a
conversation is essential for developing technologies that can understand
social behavior and devices that can augment human hearing by amplifying
particular sound sources. The computer vision and audio research communities
have made great strides towards recognizing sound sources and speakers in
scenes. In this work, we take a step further by focusing on the problem of
localizing auditory attention targets in egocentric video, or detecting who in
a camera wearer's field of view they are listening to. To tackle the new and
challenging Selective Auditory Attention Localization problem, we propose an
end-to-end deep learning approach that uses egocentric video and multichannel
audio to predict the heatmap of the camera wearer's auditory attention. Our
approach leverages spatiotemporal audiovisual features and holistic reasoning
about the scene to make predictions, and outperforms a set of baselines on a
challenging multi-speaker conversation dataset. Project page:
https://fkryan.github.io/saal
Related papers
- Audio-Visual Talker Localization in Video for Spatial Sound Reproduction [3.2472293599354596]
In this research, we detect and locate the active speaker in the video.
We found the role of the two modalities to complement each other.
Future investigations will assess the robustness of the model in noisy and highly reverberant environments.
arXiv Detail & Related papers (2024-06-01T16:47:07Z) - The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective [36.09288501153965]
We introduce the Ego-Exocentric Conversational Graph Prediction problem.
We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV)
Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities.
arXiv Detail & Related papers (2023-12-20T09:34:22Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection [9.914246432182873]
We show that an end-to-end model performs at least as well as a considerably larger two-step system under various noise conditions.
In experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task.
arXiv Detail & Related papers (2022-05-11T15:55:31Z) - Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization [13.144367063836597]
We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results.
Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
arXiv Detail & Related papers (2022-01-06T05:40:16Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.