Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning
- URL: http://arxiv.org/abs/2303.03737v1
- Date: Tue, 7 Mar 2023 08:53:20 GMT
- Title: Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning
- Authors: Zhaoxi Mu, Xinyu Yang, Wenjing Zhu
- Abstract summary: Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
- Score: 9.84949849886926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has shown advanced performance in speech separation, benefiting
from its ability to capture global features. However, capturing local features
and channel information of audio sequences in speech separation is equally
important. In this paper, we present a novel approach named Intra-SE-Conformer
and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a
new network SE-Conformer that can model audio sequences in multiple dimensions
and scales, and apply it to the dual-path speech separation framework.
Furthermore, we propose Multi-Block Feature Aggregation to improve the
separation effect by selectively utilizing information from the intermediate
blocks of the separation network. Meanwhile, we propose a speaker similarity
discriminative loss to optimize the speech separation model to address the
problem of poor performance when speakers have similar voices. Experimental
results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can
achieve state-of-the-art results.
Related papers
- RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues [45.095482324156606]
We propose a simultaneous multi-speaker separation framework that can facilitate the concurrent separation of multiple speakers.
Experimental results on the VoxCeleb2 and LRS3 datasets demonstrate that our method achieves state-of-the-art performance in separating mixtures with 2, 3, 4, and 5 speakers.
arXiv Detail & Related papers (2024-07-27T09:56:23Z) - Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Monaural Multi-Speaker Speech Separation Using Efficient Transformer
Model [0.0]
"Monaural multi-speaker speech separation" presents a speech-separation model based on the Transformer architecture and its efficient forms.
The model has been trained with the LibriMix dataset containing diverse speakers' utterances.
arXiv Detail & Related papers (2023-07-29T15:10:46Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.