SpEx: Multi-Scale Time Domain Speaker Extraction Network
- URL: http://arxiv.org/abs/2004.08326v1
- Date: Fri, 17 Apr 2020 16:13:06 GMT
- Title: SpEx: Multi-Scale Time Domain Speaker Extraction Network
- Authors: Chenglin Xu, Wei Rao, Eng Siong Chng and Haizhou Li
- Abstract summary: Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
- Score: 89.00319878262005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker extraction aims to mimic humans' selective auditory attention by
extracting a target speaker's voice from a multi-talker environment. It is
common to perform the extraction in frequency-domain, and reconstruct the
time-domain signal from the extracted magnitude and estimated phase spectra.
However, such an approach is adversely affected by the inherent difficulty of
phase estimation. Inspired by Conv-TasNet, we propose a time-domain speaker
extraction network (SpEx) that converts the mixture speech into multi-scale
embedding coefficients instead of decomposing the speech signal into magnitude
and phase spectra. In this way, we avoid phase estimation. The SpEx network
consists of four network components, namely speaker encoder, speech encoder,
speaker extractor, and speech decoder. Specifically, the speech encoder
converts the mixture speech into multi-scale embedding coefficients, the
speaker encoder learns to represent the target speaker with a speaker
embedding. The speaker extractor takes the multi-scale embedding coefficients
and target speaker embedding as input and estimates a receptive mask. Finally,
the speech decoder reconstructs the target speaker's speech from the masked
embedding coefficients. We also propose a multi-task learning framework and a
multi-scale embedding implementation. Experimental results show that the
proposed SpEx achieves 37.3%, 37.7% and 15.0% relative improvements over the
best baseline in terms of signal-to-distortion ratio (SDR), scale-invariant SDR
(SI-SDR), and perceptual evaluation of speech quality (PESQ) under an open
evaluation condition.
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - LocSelect: Target Speaker Localization with an Auditory Selective
Hearing Mechanism [45.90677498529653]
We present a target speaker localization algorithm with a selective hearing mechanism.
Our proposed network LocSelect achieves a mean absolute error (MAE) of 3.55 and an accuracy (ACC) of 87.40%.
arXiv Detail & Related papers (2023-10-16T15:19:05Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent
Speech Separation [7.453268060082337]
We propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning.
Experimental results demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-12-01T11:06:36Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.