RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
- URL: http://arxiv.org/abs/2407.19224v2
- Date: Tue, 30 Jul 2024 02:27:56 GMT
- Title: RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
- Authors: Tianrui Pan, Jie Liu, Bohan Wang, Jie Tang, Gangshan Wu,
- Abstract summary: We propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers.
Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers.
- Score: 45.095482324156606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS methods employ guiding videos to sequentially isolate individual speakers from the given audio mixture, resulting in notable missing and noisy parts across various segments of the separated speech. In this study, we propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers within a singular process. We introduce speaker-wise interactions to establish distinctions and correlations among speakers. Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers, respectively. Additionally, our model can utilize speakers with complete audio-visual information to mitigate other visual-deficient speakers, thereby enhancing its resilience to missing visual cues. We also conduct experiments where visual information for specific speakers is entirely absent or visual frames are partially missing. The results demonstrate that our model consistently outperforms others, exhibiting the smallest performance drop across all settings involving 2, 3, 4, and 5 speakers.
Related papers
- Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion.
For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset.
We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z) - Self-supervised learning for audio-visual speaker diarization [33.87232473483064]
We propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort.
We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8%F1-scoresas well as diarization error rate reduction.
arXiv Detail & Related papers (2020-02-13T02:36:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.