Single microphone speaker extraction using unified time-frequency
Siamese-Unet
- URL: http://arxiv.org/abs/2203.02941v1
- Date: Sun, 6 Mar 2022 11:45:30 GMT
- Title: Single microphone speaker extraction using unified time-frequency
Siamese-Unet
- Authors: Aviad Eisenberg, Sharon Gannot and Shlomo E. Chazan
- Abstract summary: We propose a Siamese-Unet architecture that uses both representations.
Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra.
The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information.
- Score: 22.224446472612197
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper we present a unified time-frequency method for speaker
extraction in clean and noisy conditions. Given a mixed signal, along with a
reference signal, the common approaches for extracting the desired speaker are
either applied in the time-domain or in the frequency-domain. In our approach,
we propose a Siamese-Unet architecture that uses both representations. The
Siamese encoders are applied in the frequency-domain to infer the embedding of
the noisy and reference spectra, respectively. The concatenated representations
are then fed into the decoder to estimate the real and imaginary components of
the desired speaker, which are then inverse-transformed to the time-domain. The
model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
loss to exploit the time-domain information. The time-domain loss is also
regularized with frequency-domain loss to preserve the speech patterns.
Experimental results demonstrate that the unified approach is not only very
easy to train, but also provides superior results as compared with
state-of-the-art (SOTA) Blind Source Separation (BSS) methods, as well as
commonly used speaker extraction approach.
Related papers
- RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z) - Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [1.4277428617774877]
We present Vocos, a new model that directly generates Fourier spectral coefficients.
It substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.
arXiv Detail & Related papers (2023-06-01T15:40:32Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Time-Frequency Analysis based Deep Interference Classification for
Frequency Hopping System [2.8123846032806035]
interference classification plays an important role in protecting the authorized communication system.
In this paper, the interference classification problem for the frequency hopping communication system is discussed.
Considering the possibility of presence multiple interferences in the frequency hopping system, the linear and bilinear transform based composite time-frequency analysis method is adopted.
arXiv Detail & Related papers (2021-07-21T14:22:40Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.