Speakerfilter-Pro: an improved target speaker extractor combines the
time domain and frequency domain
- URL: http://arxiv.org/abs/2010.13053v1
- Date: Sun, 25 Oct 2020 07:30:30 GMT
- Title: Speakerfilter-Pro: an improved target speaker extractor combines the
time domain and frequency domain
- Authors: Shulin He, Hao Li, Xueliang Zhang
- Abstract summary: This paper introduces an improved target speaker extractor, referred to as Speakerfilter-Pro, based on our previous Speakerfilter model.
The Speakerfilter uses a bi-direction gated recurrent unit (BGRU) module to characterize the target speaker from anchor speech and use a convolutional recurrent network (CRN) module to separate the target speech from a noisy signal.
The WaveUNet has been proven to have a better ability to perform speech separation in the time domain.
- Score: 28.830492233611196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces an improved target speaker extractor, referred to as
Speakerfilter-Pro, based on our previous Speakerfilter model. The Speakerfilter
uses a bi-direction gated recurrent unit (BGRU) module to characterize the
target speaker from anchor speech and use a convolutional recurrent network
(CRN) module to separate the target speech from a noisy signal.Different from
the Speakerfilter, the Speakerfilter-Pro sticks a WaveUNet module in the
beginning and the ending, respectively. The WaveUNet has been proven to have a
better ability to perform speech separation in the time domain. In order to
extract the target speaker information better, the complex spectrum instead of
the magnitude spectrum is utilized as the input feature for the CRN module.
Experiments are conducted on the two-speaker dataset (WSJ0-mix2) which is
widely used for speaker extraction. The systematic evaluation shows that the
Speakerfilter-Pro outperforms the Speakerfilter and other baselines, and
achieves a signal-to-distortion ratio (SDR) of 14.95 dB.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings [4.125756306660331]
Speaker Diarization (SD) aims at grouping speech segments that belong to the same speaker.
Beamforming, i.e., spatial filtering, is a common practice to process multi-microphone audio data.
This paper proposes a self-attention-based algorithm to select the output of a bank of fixed spatial filters.
arXiv Detail & Related papers (2024-06-05T13:28:28Z) - Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent
Speech Separation [7.453268060082337]
We propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning.
Experimental results demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-12-01T11:06:36Z) - Multi-stage Speaker Extraction with Utterance and Frame-Level Reference
Signals [113.78060608441348]
We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample.
For the first time, we use frame-level sequential speech embedding as the reference for target speaker.
arXiv Detail & Related papers (2020-11-19T03:08:04Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in
Non-ideal Audio Signals [19.053492887246826]
We propose a deep learning-based technique to deduce the filterbank design from vast amounts of speech audio.
The purpose of such a filterbank is to extract features robust to non-ideal audio conditions, such as degraded, short duration, and multi-lingual speech.
arXiv Detail & Related papers (2020-08-26T16:50:26Z) - SpEx: Multi-Scale Time Domain Speaker Extraction Network [89.00319878262005]
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment.
It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra.
We propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
arXiv Detail & Related papers (2020-04-17T16:13:06Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.