CS-Rep: Making Speaker Verification Networks Embracing
Re-parameterization
- URL: http://arxiv.org/abs/2110.13465v1
- Date: Tue, 26 Oct 2021 08:00:03 GMT
- Title: CS-Rep: Making Speaker Verification Networks Embracing
Re-parameterization
- Authors: Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Lin Zhang, Yantao Ji, Junhai
Xu, Xugang Lu
- Abstract summary: This study proposes cross-sequential re- parameterization (CS-Rep) to increase the inference speed and verification accuracy of models.
Rep-TDNN increases the actual inference speed by about 50% and reduces the EER by 10%.
- Score: 27.38202134344989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speaker verification (ASV) systems, which determine whether two
speeches are from the same speaker, mainly focus on verification accuracy while
ignoring inference speed. However, in real applications, both inference speed
and verification accuracy are essential. This study proposes cross-sequential
re-parameterization (CS-Rep), a novel topology re-parameterization strategy for
multi-type networks, to increase the inference speed and verification accuracy
of models. CS-Rep solves the problem that existing re-parameterization methods
are unsuitable for typical ASV backbones. When a model applies CS-Rep, the
training-period network utilizes a multi-branch topology to capture speaker
information, whereas the inference-period model converts to a time-delay neural
network (TDNN)-like plain backbone with stacked TDNN layers to achieve the fast
inference speed. Based on CS-Rep, an improved TDNN with friendly test and
deployment called Rep-TDNN is proposed. Compared with the state-of-the-art
model ECAPA-TDNN, which is highly recognized in the industry, Rep-TDNN
increases the actual inference speed by about 50% and reduces the EER by 10%.
The code will be released.
Related papers
- VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Alignment Restricted Streaming Recurrent Neural Network Transducer [29.218353627837214]
We propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T models.
The Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER)
The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency.
arXiv Detail & Related papers (2020-11-05T19:38:54Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - SRDCNN: Strongly Regularized Deep Convolution Neural Network
Architecture for Time-series Sensor Signal Classification Tasks [4.950427992960756]
We present SRDCNN: Strongly Regularized Deep Convolution Neural Network (DCNN) based deep architecture to perform time series classification tasks.
The novelty of the proposed approach is that the network weights are regularized by both L1 and L2 norm penalties.
arXiv Detail & Related papers (2020-07-14T08:42:39Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z) - Exploring Pre-training with Alignments for RNN Transducer based
End-to-End Speech Recognition [39.497407288772386]
recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research.
In this work, we leverage external alignments to seed the RNN-T model.
Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively.
arXiv Detail & Related papers (2020-05-01T19:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.