Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
Recognition and Active Speaker Detection
- URL: http://arxiv.org/abs/2205.05206v1
- Date: Tue, 10 May 2022 23:03:19 GMT
- Title: Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
Recognition and Active Speaker Detection
- Authors: Otavio Braga, Olivier Siohan
- Abstract summary: In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face.
Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio.
Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces.
This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
- Score: 9.914246432182873
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Under noisy conditions, automatic speech recognition (ASR) can greatly
benefit from the addition of visual signals coming from a video of the
speaker's face. However, when multiple candidate speakers are visible this
traditionally requires solving a separate problem, namely active speaker
detection (ASD), which entails selecting at each moment in time which of the
visible faces corresponds to the audio. Recent work has shown that we can solve
both problems simultaneously by employing an attention mechanism over the
competing video tracks of the speakers' faces, at the cost of sacrificing some
accuracy on active speaker detection. This work closes this gap in active
speaker detection accuracy by presenting a single model that can be jointly
trained with a multi-task loss. By combining the two tasks during training we
reduce the ASD classification accuracy by approximately 25%, while
simultaneously improving the ASR performance when compared to the multi-person
baseline trained exclusively for ASR.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z) - Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification [0.4681661603096334]
We show that an auxiliary task with even weak labels can increase the quality of the learned speaker representation.
We also extend the Generalized End-to-End Loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space.
Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the VoxCeleb1-O/E/H test sets.
arXiv Detail & Related papers (2023-09-13T17:45:41Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active
Speaker Selection [9.914246432182873]
We show that an end-to-end model performs at least as well as a considerably larger two-step system under various noise conditions.
In experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task.
arXiv Detail & Related papers (2022-05-11T15:55:31Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Look\&Listen: Multi-Modal Correlation Learning for Active Speaker
Detection and Speech Enhancement [18.488808141923492]
ADENet is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling.
Cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning.
arXiv Detail & Related papers (2022-03-04T09:53:19Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Speaker-Utterance Dual Attention for Speaker and Utterance Verification [77.2346078109261]
We implement an idea of speaker-utterance dual attention (SUDA) in a unified neural network.
The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams.
arXiv Detail & Related papers (2020-08-20T11:37:57Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.