Cross modal video representations for weakly supervised active speaker
localization
- URL: http://arxiv.org/abs/2003.04358v2
- Date: Wed, 3 Nov 2021 22:30:39 GMT
- Title: Cross modal video representations for weakly supervised active speaker
localization
- Authors: Rahul Sharma, Krishna Somandepalli and Shrikanth Narayanan
- Abstract summary: Cross-modal neural network for learning visual representations is presented.
We present a weakly supervised system for the task of localizing active speakers in movie content.
We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework.
- Score: 39.67239953795999
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: An objective understanding of media depictions, such as inclusive portrayals
of how much someone is heard and seen on screen such as in film and television,
requires the machines to discern automatically who, when, how, and where
someone is talking, and not. Speaker activity can be automatically discerned
from the rich multimodal information present in the media content. This is
however a challenging problem due to the vast variety and contextual
variability in the media content, and the lack of labeled data. In this work,
we present a cross-modal neural network for learning visual representations,
which have implicit information pertaining to the spatial location of a speaker
in the visual frames. Avoiding the need for manual annotations for active
speakers in visual frames, acquiring of which is very expensive, we present a
weakly supervised system for the task of localizing active speakers in movie
content. We use the learned cross-modal visual representations, and provide
weak supervision from movie subtitles acting as a proxy for voice activity,
thus requiring no manual annotations. We evaluate the performance of the
proposed system on the AVA active speaker dataset and demonstrate the
effectiveness of the cross-modal embeddings for localizing active speakers in
comparison to fully supervised systems. We also demonstrate state-of-the-art
performance for the task of voice activity detection in an audio-visual
framework, especially when speech is accompanied by noise and music.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Unsupervised active speaker detection in media content using cross-modal
information [37.28070242751129]
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies.
We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task.
We show competitive performance to state-of-the-art fully supervised methods.
arXiv Detail & Related papers (2022-09-24T00:51:38Z) - Audio-video fusion strategies for active speaker detection in meetings [5.61861182374067]
We propose two types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks.
For our application context, adding motion information greatly improves performance.
We have shown that attention-based fusion improves performance while reducing the standard deviation.
arXiv Detail & Related papers (2022-06-09T08:20:52Z) - Using Active Speaker Faces for Diarization in TV shows [37.28070242751129]
We perform face clustering on the active speaker faces and show superior speaker diarization performance compared to the state-of-the-art audio-based diarization methods.
We also observe that a moderately well-performing active speaker system could outperform the audio-based diarization systems.
arXiv Detail & Related papers (2022-03-30T00:37:19Z) - Look Who's Talking: Active Speaker Detection in the Wild [30.22352874520012]
We present a novel audio-visual dataset for active speaker detection in the wild.
Active Speakers in the Wild (ASW) dataset contains videos and co-occurring speech segments with dense speech activity labels.
Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way.
arXiv Detail & Related papers (2021-08-17T14:16:56Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.