Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization
- URL: http://arxiv.org/abs/2312.14021v1
- Date: Thu, 21 Dec 2023 16:53:04 GMT
- Title: Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization
- Authors: Davide Berghi and Philip J. B. Jackson
- Abstract summary: We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
- Score: 3.836171323110284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional audio-visual approaches for active speaker detection (ASD)
typically rely on visually pre-extracted face tracks and the corresponding
single-channel audio to find the speaker in a video. Therefore, they tend to
fail every time the face of the speaker is not visible. We demonstrate that a
simple audio convolutional recurrent neural network (CRNN) trained with spatial
input features extracted from multichannel audio can perform simultaneous
horizontal active speaker detection and localization (ASDL), independently of
the visual modality. To address the time and cost of generating ground truth
labels to train such a system, we propose a new self-supervised training
pipeline that embraces a ``student-teacher'' learning approach. A conventional
pre-trained active speaker detector is adopted as a ``teacher'' network to
provide the position of the speakers as pseudo-labels. The multichannel audio
``student'' network is trained to generate the same results. At inference, the
student network can generalize and locate also the occluded speakers that the
teacher network is not able to detect visually, yielding considerable
improvements in recall rate. Experiments on the TragicTalkers dataset show that
an audio network trained with the proposed self-supervised learning approach
can exceed the performance of the typical audio-visual methods and produce
results competitive with the costly conventional supervised training. We
demonstrate that improvements can be achieved when minimal manual supervision
is introduced in the learning pipeline. Further gains may be sought with larger
training sets and integrating vision with the multichannel audio system.
Related papers
- Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion.
For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset.
We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Multi-task Learning for Speaker Verification and Voice Trigger Detection [18.51531434428444]
We investigate training a single network to perform both tasks jointly.
We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data.
Results demonstrate that the network is able to encode both phonetic emphand speaker information in its learnt representations.
arXiv Detail & Related papers (2020-01-26T21:19:27Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.