Late Audio-Visual Fusion for In-The-Wild Speaker Diarization
- URL: http://arxiv.org/abs/2211.01299v2
- Date: Wed, 27 Sep 2023 12:47:35 GMT
- Title: Late Audio-Visual Fusion for In-The-Wild Speaker Diarization
- Authors: Zexu Pan, Gordon Wichern, Fran\c{c}ois G. Germain, Aswin Subramanian,
Jonathan Le Roux
- Abstract summary: We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion.
For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset.
We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
- Score: 33.0046568984949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker diarization is well studied for constrained audios but little
explored for challenging in-the-wild videos, which have more speakers, shorter
utterances, and inconsistent on-screen speakers. We address this gap by
proposing an audio-visual diarization model which combines audio-only and
visual-centric sub-systems via late fusion. For audio, we show that an
attractor-based end-to-end system (EEND-EDA) performs remarkably well when
trained with our proposed recipe of a simulated proxy dataset, and propose an
improved version, EEND-EDA++, that uses attention in decoding and a speaker
recognition loss during training to better handle the larger number of
speakers. The visual-centric sub-system leverages facial attributes and
lip-audio synchrony for identity and speech activity estimation of on-screen
speakers. Both sub-systems surpass the state of the art (SOTA) by a large
margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD
benchmark.
Related papers
- RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues [45.095482324156606]
We propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers.
Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers.
arXiv Detail & Related papers (2024-07-27T09:56:23Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Leveraging Visual Supervision for Array-based Active Speaker Detection
and Localization [3.836171323110284]
We show that a simple audio convolutional recurrent neural network can perform simultaneous horizontal active speaker detection and localization.
We propose a new self-supervised training pipeline that embraces a student-teacher'' learning approach.
arXiv Detail & Related papers (2023-12-21T16:53:04Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - AVA-AVD: Audio-visual Speaker Diarization in the Wild [26.97787596025907]
Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios.
We propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.
arXiv Detail & Related papers (2021-11-29T11:02:41Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.