On Time Domain Conformer Models for Monaural Speech Separation in Noisy
Reverberant Acoustic Environments
- URL: http://arxiv.org/abs/2310.06125v1
- Date: Mon, 9 Oct 2023 20:02:11 GMT
- Title: On Time Domain Conformer Models for Monaural Speech Separation in Noisy
Reverberant Acoustic Environments
- Authors: William Ravenscroft and Stefan Goetze and Thomas Hain
- Abstract summary: Time domain conformers (TD-Conformers) are an analogue of the DP approach in that they also process local and global context sequentially.
Best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks.
- Score: 20.592466025674643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech separation remains an important topic for multi-speaker technology
researchers. Convolution augmented transformers (conformers) have performed
well for many speech processing tasks but have been under-researched for speech
separation. Most recent state-of-the-art (SOTA) separation models have been
time-domain audio separation networks (TasNets). A number of successful models
have made use of dual-path (DP) networks which sequentially process local and
global information. Time domain conformers (TD-Conformers) are an analogue of
the DP approach in that they also process local and global context sequentially
but have a different time complexity function. It is shown that for realistic
shorter signal lengths, conformers are more efficient when controlling for
feature dimension. Subsampling layers are proposed to further improve
computational efficiency. The best TD-Conformer achieves 14.6 dB and 21.2 dB
SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks, respectively.
Related papers
- DPATD: Dual-Phase Audio Transformer for Denoising [25.097894984130733]
We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising.
Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module.
arXiv Detail & Related papers (2023-10-30T14:44:59Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Deformable Temporal Convolutional Networks for Monaural Noisy
Reverberant Speech Separation [26.94528951545861]
Speech separation models are used for isolating individual speakers in many speech processing applications.
Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks.
One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks.
Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal.
arXiv Detail & Related papers (2022-10-27T10:29:19Z) - End-To-End Audiovisual Feature Fusion for Active Speaker Detection [7.631698269792165]
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform.
Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.
arXiv Detail & Related papers (2022-07-27T10:25:59Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition
with Source Localization [73.62550438861942]
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR)
In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance.
arXiv Detail & Related papers (2020-10-30T20:26:28Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.