Related papers: Speaker Recognition using SincNet and X-Vector Fusion

Speaker Recognition using SincNet and X-Vector Fusion

URL: http://arxiv.org/abs/2004.02219v1
Date: Sun, 5 Apr 2020 14:44:14 GMT
Title: Speaker Recognition using SincNet and X-Vector Fusion
Authors: Mayank Tripathi, Divyanshu Singh, Seba Susan
Abstract summary: We propose an innovative approach to perform speaker recognition by fusing two recently introduced deep neural networks (DNNs) namely - SincNet and X-Celeb1.
Score: 8.637110868126546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose an innovative approach to perform speaker recognition by fusing two recently introduced deep neural networks (DNNs) namely - SincNet and X-Vector. The idea behind using SincNet filters on the raw speech waveform is to extract more distinguishing frequency-related features in the initial convolution layers of the CNN architecture. X-Vectors are used to take advantage of the fact that this embedding is an efficient method to churn out fixed dimension features from variable length speech utterances, something which is challenging in plain CNN techniques, making it efficient both in terms of speed and accuracy. Our approach uses the best of both worlds by combining X-vector in the later layers while using SincNet filters in the initial layers of our deep model. This approach allows the network to learn better embedding and converge quicker. Previous works use either X-Vector or SincNet Filters or some modifications, however we introduce a novel fusion architecture wherein we have combined both the techniques to gather more information about the speech signal hence, giving us better results. Our method focuses on the VoxCeleb1 dataset for speaker recognition, and we have used it for both training and testing purposes.

Related papers

Raw Audio Classification with Cosine Convolutional Neural Network (CosCovNN) [1.0237120900821557]
This study introduces the Cosine Convolutional Neural Network (CosCovNN) replacing the traditional CNN filters with Cosine filters. The CosCovNN surpasses the accuracy of the equivalent CNN architectures with approximately $77%$ less parameters. Our findings show that cosine filters can greatly improve the efficiency and accuracy of CNNs in raw audio classification.
arXiv Detail & Related papers (2024-11-30T01:39:16Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Adaptive Convolutional Dictionary Network for CT Metal Artifact Reduction [62.691996239590125]
We propose an adaptive convolutional dictionary network (ACDNet) for metal artifact reduction. Our ACDNet can automatically learn the prior for artifact-free CT images via training data and adaptively adjust the representation kernels for each input CT image. Our method inherits the clear interpretability of model-based methods and maintains the powerful representation ability of learning-based methods.
arXiv Detail & Related papers (2022-05-16T06:49:36Z)
StereoSpike: Depth Learning with a Spiking Neural Network [0.0]
We present an end-to-end neuromorphic approach to depth estimation. We use a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts.
arXiv Detail & Related papers (2021-09-28T14:11:36Z)
Graph Neural Networks with Adaptive Frequency Response Filter [55.626174910206046]
We develop a graph neural network framework AdaGNN with a well-smooth adaptive frequency response filter. We empirically validate the effectiveness of the proposed framework on various benchmark datasets.
arXiv Detail & Related papers (2021-04-26T19:31:21Z)
Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z)
Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z)
Compiling ONNX Neural Network Models Using MLIR [51.903932262028235]
We present a preliminary report on our onnx-mlir compiler, which generates code for the inference of deep neural network models. Onnx-mlir relies on the Multi-Level Intermediate Representation (MLIR) infrastructure recently integrated in the LLVM project.
arXiv Detail & Related papers (2020-08-19T05:28:08Z)
AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms [44.192033435682944]
We improve RawNet by scaling feature maps using various methods. The best performing system reduces the equal error rate by half compared to the original RawNet.
arXiv Detail & Related papers (2020-04-01T15:51:56Z)
FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA [27.50143717931293]
WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution. We develop the first accelerator platformtextitFastWave for autoregressive convolutional neural networks.
arXiv Detail & Related papers (2020-02-09T06:15:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.