Parameterized Channel Normalization for Far-field Deep Speaker
Verification
- URL: http://arxiv.org/abs/2109.12056v1
- Date: Fri, 24 Sep 2021 16:22:31 GMT
- Title: Parameterized Channel Normalization for Far-field Deep Speaker
Verification
- Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen
- Abstract summary: We focus on two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization (PCMN)
We evaluate the performance on Hi-MIA, a recent large-scale far-field speech corpus, with varied microphone and positional settings.
Our methods outperform conventional mel filterbank features, with maximum of 33.5% and 39.5% relative improvement on equal error rate under matched microphone and mismatched microphone conditions, respectively.
- Score: 21.237143465298505
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address far-field speaker verification with deep neural network (DNN)
based speaker embedding extractor, where mismatch between enrollment and test
data often comes from convolutive effects (e.g. room reverberation) and noise.
To mitigate these effects, we focus on two parametric normalization methods:
per-channel energy normalization (PCEN) and parameterized cepstral mean
normalization (PCMN). Both methods contain differentiable parameters and thus
can be conveniently integrated to, and jointly optimized with the DNN using
automatic differentiation methods. We consider both fixed and trainable
(data-driven) variants of each method. We evaluate the performance on Hi-MIA, a
recent large-scale far-field speech corpus, with varied microphone and
positional settings. Our methods outperform conventional mel filterbank
features, with maximum of 33.5% and 39.5% relative improvement on equal error
rate under matched microphone and mismatched microphone conditions,
respectively.
Related papers
- Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior [23.448790295875828]
Style Transfer with Inference-Time optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track.<n>We introduce a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space.<n>The resulting optimisation is equivalent to maximum-a-posteriori estimation.
arXiv Detail & Related papers (2025-05-16T14:40:31Z) - Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features [10.480691005356967]
We propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR) and clarity (C50) across 10 frequency bands.
The proposed framework utilizes a novel feature named Spectro-Spatial Co Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal.
arXiv Detail & Related papers (2024-11-05T15:20:23Z) - Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo
Matching [77.133400999703]
Correlation based stereo matching has achieved outstanding performance.
Current methods with a fixed model do not work uniformly well across various datasets.
This paper proposes a new perspective to dynamically calculate correlation for robust stereo matching.
arXiv Detail & Related papers (2023-07-26T09:47:37Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Improved far-field speech recognition using Joint Variational
Autoencoder [5.320201231911981]
We propose mapping speech features from far-field to close-talk using denoising autoencoder (DA)
Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
arXiv Detail & Related papers (2022-04-24T14:14:04Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Fusion of Range and Stereo Data for High-Resolution Scene-Modeling [20.824550995195057]
This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps.
We combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation.
The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function.
arXiv Detail & Related papers (2020-12-12T09:37:42Z) - AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching [50.06646151004375]
A novel domain-adaptive pipeline called AdaStereo aims to align multi-level representations for deep stereo matching networks.
Our AdaStereo models achieve state-of-the-art cross-domain performance on multiple stereo benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo.
arXiv Detail & Related papers (2020-04-09T16:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.