Speaker Recognition using SincNet and X-Vector Fusion
- URL: http://arxiv.org/abs/2004.02219v1
- Date: Sun, 5 Apr 2020 14:44:14 GMT
- Title: Speaker Recognition using SincNet and X-Vector Fusion
- Authors: Mayank Tripathi, Divyanshu Singh, Seba Susan
- Abstract summary: We propose an innovative approach to perform speaker recognition by fusing two recently introduced deep neural networks (DNNs) namely - SincNet and X-Celeb1.
- Score: 8.637110868126546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an innovative approach to perform speaker
recognition by fusing two recently introduced deep neural networks (DNNs)
namely - SincNet and X-Vector. The idea behind using SincNet filters on the raw
speech waveform is to extract more distinguishing frequency-related features in
the initial convolution layers of the CNN architecture. X-Vectors are used to
take advantage of the fact that this embedding is an efficient method to churn
out fixed dimension features from variable length speech utterances, something
which is challenging in plain CNN techniques, making it efficient both in terms
of speed and accuracy. Our approach uses the best of both worlds by combining
X-vector in the later layers while using SincNet filters in the initial layers
of our deep model. This approach allows the network to learn better embedding
and converge quicker. Previous works use either X-Vector or SincNet Filters or
some modifications, however we introduce a novel fusion architecture wherein we
have combined both the techniques to gather more information about the speech
signal hence, giving us better results. Our method focuses on the VoxCeleb1
dataset for speaker recognition, and we have used it for both training and
testing purposes.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Adaptive Convolutional Dictionary Network for CT Metal Artifact
Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction.
Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image.
Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z) - StereoSpike: Depth Learning with a Spiking Neural Network [0.0]
We present an end-to-end neuromorphic approach to depth estimation.
We use a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike.
We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts.
arXiv Detail & Related papers (2021-09-28T14:11:36Z) - Graph Neural Networks with Adaptive Frequency Response Filter [55.626174910206046]
We develop a graph neural network framework AdaGNN with a well-smooth adaptive frequency response filter.
We empirically validate the effectiveness of the proposed framework on various benchmark datasets.
arXiv Detail & Related papers (2021-04-26T19:31:21Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths.
Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z) - Compiling ONNX Neural Network Models Using MLIR [51.903932262028235]
We present a preliminary report on our onnx-mlir compiler, which generates code for the inference of deep neural network models.
Onnx-mlir relies on the Multi-Level Intermediate Representation (MLIR) infrastructure recently integrated in the LLVM project.
arXiv Detail & Related papers (2020-08-19T05:28:08Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Improved RawNet with Feature Map Scaling for Text-independent Speaker
Verification using Raw Waveforms [44.192033435682944]
We improve RawNet by scaling feature maps using various methods.
The best performing system reduces the equal error rate by half compared to the original RawNet.
arXiv Detail & Related papers (2020-04-01T15:51:56Z) - FastWave: Accelerating Autoregressive Convolutional Neural Networks on
FPGA [27.50143717931293]
WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution.
We develop the first accelerator platformtextitFastWave for autoregressive convolutional neural networks.
arXiv Detail & Related papers (2020-02-09T06:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.