Semi-supervised source localization in reverberant environments with
deep generative modeling
- URL: http://arxiv.org/abs/2101.10636v1
- Date: Tue, 26 Jan 2021 08:54:38 GMT
- Title: Semi-supervised source localization in reverberant environments with
deep generative modeling
- Authors: Michael J. Bianco, Sharon Gannot, Efren Fernandez-Grande, and Peter
Gerstoft
- Abstract summary: A semi-supervised approach to acoustic source localization in reverberant environments is proposed.
The approach is based on deep generative modeling.
We find that VAE-SSL can outperform both SRP-PHAT and fully supervised CNNs.
- Score: 25.085177610870666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A semi-supervised approach to acoustic source localization in reverberant
environments, based on deep generative modeling, is proposed. Localization in
reverberant environments remains an open challenge. Even with large data
volumes, the number of labels available for supervised learning in reverberant
environments is usually small. We address this issue by performing
semi-supervised learning (SSL) with convolutional variational autoencoders
(VAEs) on speech signals in reverberant environments. The VAE is trained to
generate the phase of relative transfer functions (RTFs) between microphones,
in parallel with a direction of arrival (DOA) classifier based on RTF-phase, on
both labeled and unlabeled RTF samples. In learning to perform these tasks, the
VAE-SSL explicitly learns to separate the physical causes of the RTF-phase
(i.e., source location) from distracting signal characteristics such as noise
and speech activity. Relative to existing semi-supervised localization methods
in acoustics, VAE-SSL is effectively an end-to-end processing approach which
relies on minimal preprocessing of RTF-phase features. The VAE-SSL approach is
compared with the steered response power with phase transform (SRP-PHAT) and
fully supervised CNNs. We find that VAE-SSL can outperform both SRP-PHAT and
CNN in label-limited scenarios. Further, the trained VAE-SSL system can
generate new RTF-phase samples, which shows the VAE-SSL approach learns the
physics of the acoustic environment. The generative modeling in VAE-SSL thus
provides a means of interpreting the learned representations.
Related papers
- R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)
A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.
This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding.
A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z) - Learning Cautiously in Federated Learning with Noisy and Heterogeneous
Clients [4.782145666637457]
Federated learning (FL) is a distributed framework for collaboratively training with privacy guarantees.
In real-world scenarios, clients may have Non-IID data (local class imbalance) with poor annotation quality (label noise)
We propose FedCNI without using an additional clean proxy dataset.
It includes a noise-resilient local solver and a robust global aggregator.
arXiv Detail & Related papers (2023-04-06T06:47:14Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Semi-supervised source localization with deep generative modeling [27.344649091365067]
We propose a semi-supervised localization approach based on deep generative modeling with variational autoencoders (VAEs)
VAE-SSL can outperform both SRP-PHAT and CNN in label-limited scenarios.
arXiv Detail & Related papers (2020-05-27T04:59:52Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.