Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection
- URL: http://arxiv.org/abs/2212.00539v1
- Date: Thu, 1 Dec 2022 14:46:00 GMT
- Title: Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection
- Authors: Rahul Sharma and Shrikanth Narayanan
- Abstract summary: Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
- Score: 37.28070242751129
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Active speaker detection in videos addresses associating a source face,
visible in the video frames, with the underlying speech in the audio modality.
The two primary sources of information to derive such a speech-face
relationship are i) visual activity and its interaction with the speech signal
and ii) co-occurrences of speakers' identities across modalities in the form of
face and speech. The two approaches have their limitations: the audio-visual
activity models get confused with other frequently occurring vocal activities,
such as laughing and chewing, while the speakers' identity-based methods are
limited to videos having enough disambiguating information to establish a
speech-face association. Since the two approaches are independent, we
investigate their complementary nature in this work. We propose a novel
unsupervised framework to guide the speakers' cross-modal identity association
with the audio-visual activity for active speaker detection. Through
experiments on entertainment media videos from two benchmark datasets, the AVA
active speaker (movies) and Visual Person Clustering Dataset (TV shows), we
show that a simple late fusion of the two approaches enhances the active
speaker detection performance.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Unsupervised active speaker detection in media content using cross-modal
information [37.28070242751129]
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies.
We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task.
We show competitive performance to state-of-the-art fully supervised methods.
arXiv Detail & Related papers (2022-09-24T00:51:38Z) - Look Who's Talking: Active Speaker Detection in the Wild [30.22352874520012]
We present a novel audio-visual dataset for active speaker detection in the wild.
Active Speakers in the Wild (ASW) dataset contains videos and co-occurring speech segments with dense speech activity labels.
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
arXiv Detail & Related papers (2021-08-17T14:16:56Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Cross modal video representations for weakly supervised active speaker
localization [39.67239953795999]
Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
arXiv Detail & Related papers (2020-03-09T18:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.