Sparse Mixture of Local Experts for Efficient Speech Enhancement
- URL: http://arxiv.org/abs/2005.08128v1
- Date: Sat, 16 May 2020 23:23:22 GMT
- Title: Sparse Mixture of Local Experts for Efficient Speech Enhancement
- Authors: Aswin Sivaraman, Minje Kim
- Abstract summary: We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity.
Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
- Score: 19.645016575334786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate a deep learning approach for speech denoising
through an efficient ensemble of specialist neural networks. By splitting up
the speech denoising task into non-overlapping subproblems and introducing a
classifier, we are able to improve denoising performance while also reducing
computational complexity. More specifically, the proposed model incorporates a
gating network which assigns noisy speech signals to an appropriate specialist
network based on either speech degradation level or speaker gender. In our
experiments, a baseline recurrent network is compared against an ensemble of
similarly-designed smaller recurrent networks regulated by the auxiliary gating
network. Using stochastically generated batches from a large noisy speech
corpus, the proposed model learns to estimate a time-frequency masking matrix
based on the magnitude spectrogram of an input mixture signal. Both baseline
and specialist networks are trained to estimate the ideal ratio mask, while the
gating network is trained to perform subproblem classification. Our findings
demonstrate that a fine-tuned ensemble network is able to exceed the speech
denoising capabilities of a generalist network, doing so with fewer model
parameters.
Related papers
- Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - Training neural networks with structured noise improves classification and generalization [0.0]
We show how adding structure to noisy training data can substantially improve the algorithm performance.
We also prove that the so-called Hebbian Unlearning rule coincides with the training-with-noise algorithm when noise is maximal.
arXiv Detail & Related papers (2023-02-26T22:10:23Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Full-Reference Speech Quality Estimation with Attentional Siamese Neural
Networks [0.0]
We present a full-reference speech quality prediction model with a deep learning approach.
The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network.
The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality.
arXiv Detail & Related papers (2021-05-03T12:38:25Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Untangling in Invariant Speech Recognition [17.996356271398295]
We study how information is untangled within neural networks trained to recognize speech.
We observe speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties are untangled in later layers.
We find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation.
arXiv Detail & Related papers (2020-03-03T20:48:43Z) - Boosted Locality Sensitive Hashing: Discriminative Binary Codes for
Source Separation [19.72987718461291]
We propose an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently.
We use the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model.
arXiv Detail & Related papers (2020-02-14T20:10:00Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.