Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation
- URL: http://arxiv.org/abs/2011.14334v1
- Date: Sun, 29 Nov 2020 10:48:42 GMT
- Title: Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation
- Authors: Peng Zhang, Jiaming Xu, Jing shi, Yunzhe Hao, Bo Xu
- Abstract summary: Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
- Score: 23.38624506211003
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech separation aims to separate individual voice from an audio mixture of
multiple simultaneous talkers. Although audio-only approaches achieve
satisfactory performance, they build on a strategy to handle the predefined
conditions, limiting their application in the complex auditory scene. Towards
the cocktail party problem, we propose a novel audio-visual speech separation
model. In our model, we use the face detector to detect the number of speakers
in the scene and use visual information to avoid the permutation problem. To
improve our model's generalization ability to unknown speakers, we extract
speech-related visual features from visual inputs explicitly by the
adversarially disentangled method, and use this feature to assist speech
separation. Besides, the time-domain approach is adopted, which could avoid the
phase reconstruction problem existing in the time-frequency domain models. To
compare our model's performance with other models, we create two benchmark
datasets of 2-speaker mixture from GRID and TCDTIMIT audio-visual datasets.
Through a series of experiments, our proposed model is shown to outperform the
state-of-the-art audio-only model and three audio-visual models.
Related papers
- RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues [45.095482324156606]
We propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers.
Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers.
arXiv Detail & Related papers (2024-07-27T09:56:23Z) - Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach [3.89476785897726]
We introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features.
Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted.
arXiv Detail & Related papers (2024-06-02T23:51:43Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Seeing Through the Conversation: Audio-Visual Speech Separation based on
Diffusion Model [13.96610874947899]
We propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known for its capability in generating natural samples.
For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.
We demonstrate that the proposed framework achieves state-of-the-art results on two benchmarks, including VoxCeleb2 and LRS3, producing speech with notably better naturalness.
arXiv Detail & Related papers (2023-10-30T14:39:34Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.