Multi-stream Convolutional Neural Network with Frequency Selection for
Robust Speaker Verification
- URL: http://arxiv.org/abs/2012.11159v2
- Date: Tue, 12 Jan 2021 11:29:37 GMT
- Title: Multi-stream Convolutional Neural Network with Frequency Selection for
Robust Speaker Verification
- Authors: Wei Yao, Shen Chen, Jiamin Cui, Yaolin Lou
- Abstract summary: We propose a novel framework of multi-stream Convolutional Neural Network (CNN) for speaker verification tasks.
The proposed framework accommodates diverse temporal embeddings generated from multiple streams to enhance the robustness of acoustic modeling.
We conduct extensive experiments on VoxCeleb dataset, and the experimental results demonstrate that multi-stream CNN significantly outperforms single-stream baseline.
- Score: 2.3437178262034095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker verification aims to verify whether an input speech corresponds to
the claimed speaker, and conventionally, this kind of system is deployed based
on single-stream scenario, wherein the feature extractor operates in full
frequency range. In this paper, we hypothesize that machine can learn enough
knowledge to do classification task when listening to partial frequency range
instead of full frequency range, which is so called frequency selection
technique, and further propose a novel framework of multi-stream Convolutional
Neural Network (CNN) with this technique for speaker verification tasks. The
proposed framework accommodates diverse temporal embeddings generated from
multiple streams to enhance the robustness of acoustic modeling. For the
diversity of temporal embeddings, we consider feature augmentation with
frequency selection, which is to manually segment the full-band of frequency
into several sub-bands, and the feature extractor of each stream can select
which sub-bands to use as target frequency domain. Different from conventional
single-stream solution wherein each utterance would only be processed for one
time, in this framework, there are multiple streams processing it in parallel.
The input utterance for each stream is pre-processed by a frequency selector
within specified frequency range, and post-processed by mean normalization. The
normalized temporal embeddings of each stream will flow into a pooling layer to
generate fused embeddings. We conduct extensive experiments on VoxCeleb
dataset, and the experimental results demonstrate that multi-stream CNN
significantly outperforms single-stream baseline with 20.53 % of relative
improvement in minimum Decision Cost Function (minDCF).
Related papers
- Frequency-Aware Deepfake Detection: Improving Generalizability through
Frequency Space Learning [81.98675881423131]
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images.
Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries.
We introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors.
arXiv Detail & Related papers (2024-03-12T01:28:00Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals [7.381259294661687]
We propose a frequency-aware masked autoencoder that learns to parameterize the representation of biosignals in the frequency space.
The resulting architecture effectively utilizes multimodal information during pretraining, and can be seamlessly adapted to diverse tasks and modalities at test time.
arXiv Detail & Related papers (2023-09-12T02:59:26Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Sampling-Frequency-Independent Audio Source Separation Using Convolution
Layer Based on Impulse Invariant Method [67.24600975813419]
We propose a convolution layer capable of handling arbitrary sampling frequencies by a single deep neural network.
We show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
arXiv Detail & Related papers (2021-05-10T02:33:42Z) - Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain [37.722450363816144]
We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN.
Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline.
A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
arXiv Detail & Related papers (2020-11-08T22:04:00Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.