Unsupervised active speaker detection in media content using cross-modal
information
- URL: http://arxiv.org/abs/2209.11896v1
- Date: Sat, 24 Sep 2022 00:51:38 GMT
- Title: Unsupervised active speaker detection in media content using cross-modal
information
- Authors: Rahul Sharma and Shrikanth Narayanan
- Abstract summary: We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies.
We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task.
We show competitive performance to state-of-the-art fully supervised methods.
- Score: 37.28070242751129
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a cross-modal unsupervised framework for active speaker detection
in media content such as TV shows and movies. Machine learning advances have
enabled impressive performance in identifying individuals from speech and
facial images. We leverage speaker identity information from speech and faces,
and formulate active speaker detection as a speech-face assignment task such
that the active speaker's face and the underlying speech identify the same
person (character). We express the speech segments in terms of their associated
speaker identity distances, from all other speech segments, to capture a
relative identity structure for the video. Then we assign an active speaker's
face to each speech segment from the concurrently appearing faces such that the
obtained set of active speaker faces displays a similar relative identity
structure. Furthermore, we propose a simple and effective approach to address
speech segments where speakers are present off-screen. We evaluate the proposed
system on three benchmark datasets -- Visual Person Clustering dataset,
AVA-active speaker dataset, and Columbia dataset -- consisting of videos from
entertainment and broadcast media, and show competitive performance to
state-of-the-art fully supervised methods.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Using Active Speaker Faces for Diarization in TV shows [37.28070242751129]
We perform face clustering on the active speaker faces and show superior speaker diarization performance compared to the state-of-the-art audio-based diarization methods.
We also observe that a moderately well-performing active speaker system could outperform the audio-based diarization systems.
arXiv Detail & Related papers (2022-03-30T00:37:19Z) - Look Who's Talking: Active Speaker Detection in the Wild [30.22352874520012]
We present a novel audio-visual dataset for active speaker detection in the wild.
Active Speakers in the Wild (ASW) dataset contains videos and co-occurring speech segments with dense speech activity labels.
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
arXiv Detail & Related papers (2021-08-17T14:16:56Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - FaceFilter: Audio-visual speech separation using still images [41.97445146257419]
This paper aims to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker.
arXiv Detail & Related papers (2020-05-14T15:42:31Z) - Cross modal video representations for weakly supervised active speaker
localization [39.67239953795999]
Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
arXiv Detail & Related papers (2020-03-09T18:50:50Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.