TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding
- URL: http://arxiv.org/abs/2203.09098v1
- Date: Thu, 17 Mar 2022 05:49:35 GMT
- Title: TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding
- Authors: Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu,
Lin Zhang, Yantao Ji, Jianwu Dang
- Abstract summary: Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
- Score: 60.292702363839716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker embedding is an important front-end module to explore discriminative
speaker features for many speech applications where speaker information is
needed. Current SOTA backbone networks for speaker embedding are designed to
aggregate multi-scale features from an utterance with multi-branch network
architectures for speaker representation. However, naively adding many branches
of multi-scale features with the simple fully convolutional operation could not
efficiently improve the performance due to the rapid increase of model
parameters and computational complexity. Therefore, in the most current
state-of-the-art network architectures, only a few branches corresponding to a
limited number of temporal scales could be designed for speaker embeddings. To
address this problem, in this paper, we propose an effective temporal
multi-scale (TMS) model where multi-scale branches could be efficiently
designed in a speaker embedding network almost without increasing computational
costs. The new model is based on the conventional TDNN, where the network
architecture is smartly separated into two modeling operators: a
channel-modeling operator and a temporal multi-branch modeling operator. Adding
temporal multi-scale in the temporal multi-branch operator needs only a little
bit increase of the number of parameters, and thus save more computational
budget for adding more branches with large temporal scales. Moreover, in the
inference stage, we further developed a systemic re-parameterization method to
convert the TMS-based model into a single-path-based topology in order to
increase inference speed. We investigated the performance of the new TMS method
for automatic speaker verification (ASV) on in-domain and out-of-domain
conditions. Results show that the TMS-based model obtained a significant
increase in the performance over the SOTA ASV models, meanwhile, had a faster
inference speed.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate [16.4160685571157]
Recurrent Neural Networks (RNNs) are widely recognized for their proficiency in modeling temporal dependencies.
This paper proposes a novel Delayed Memory Unit (DMU) for gated RNNs.
The DMU incorporates a delay line structure along with delay gates into vanilla RNN, thereby enhancing temporal interaction and facilitating temporal credit assignment.
arXiv Detail & Related papers (2023-10-23T14:29:48Z) - Disentangling Structured Components: Towards Adaptive, Interpretable and
Scalable Time Series Forecasting [52.47493322446537]
We develop a adaptive, interpretable and scalable forecasting framework, which seeks to individually model each component of the spatial-temporal patterns.
SCNN works with a pre-defined generative process of MTS, which arithmetically characterizes the latent structure of the spatial-temporal patterns.
Extensive experiments are conducted to demonstrate that SCNN can achieve superior performance over state-of-the-art models on three real-world datasets.
arXiv Detail & Related papers (2023-05-22T13:39:44Z) - Sequence Modeling with Multiresolution Convolutional Memory [27.218134279968062]
We build a new building block for sequence modeling called a MultiresLayer.
The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence.
Our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
arXiv Detail & Related papers (2023-05-02T17:50:54Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - CS-Rep: Making Speaker Verification Networks Embracing
Re-parameterization [27.38202134344989]
This study proposes cross-sequential re- parameterization (CS-Rep) to increase the inference speed and verification accuracy of models.
Rep-TDNN increases the actual inference speed by about 50% and reduces the EER by 10%.
arXiv Detail & Related papers (2021-10-26T08:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.