Related papers: DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

URL: http://arxiv.org/abs/2303.11020v3
Date: Tue, 1 Aug 2023 07:09:50 GMT
Title: DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification
Authors: Yangfu Li, Jiapan Gan, Xiaodan Lin
Abstract summary: We introduce a novel module called Global-aware Filter layer (GF layer) in this work. We present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV) Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10% together with a relative decline of 20% in computational cost.
Score: 3.0831477850153224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited in long utterances. Existing solutions either depend on increasing model complexity or try to balance between local features and global context to address this issue. To effectively leverage the long-term dependencies of audio signals and constrain model complexity, we introduce a novel module called Global-aware Filter layer (GF layer) in this work, which employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context. Additionally, we develop a dynamic filtering strategy and a sparse regularization method to enhance the performance of the GF layer and prevent overfitting. Based on the GF layer, we present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV), which utilizes two unique branches to extract both local and global features in parallel and employs an efficient strategy to fuse different-scale information. Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10\% together with a relative decline of 20\% in computational cost over the ECAPA-TDNN in speaker verification task. This improvement will become more evident as the utterance's duration grows. Furthermore, the DS-TDNN also beats popular deep residual models and attention-based systems on utterances of arbitrary length.

Related papers

TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition [14.384132377946154]
We introduce a new streaming ASR model, ConvRNN-T, with a novel convolutional context consisting of local and global context encoders. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet onspeech and in-house data. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.
arXiv Detail & Related papers (2022-09-29T15:33:41Z)
MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification [5.28889161958623]
We propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN) The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.
arXiv Detail & Related papers (2021-07-07T09:43:42Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
Deep Learning-based Resource Allocation For Device-to-Device Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication. A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models. Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z)
Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs) These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training. Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z)
Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment. We implement this algorithm in a real-time robotic system with a microphone array. The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation [17.358040670413505]
We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs) We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
arXiv Detail & Related papers (2020-07-06T12:32:34Z)
STDPG: A Spatio-Temporal Deterministic Policy Gradient Agent for Dynamic Routing in SDN [6.27420060051673]
Dynamic routing in software-defined networking (SDN) can be viewed as a centralized decision-making problem. We propose a novel model-free framework for dynamic routing in SDN, which is referred to as SDN-temporal deterministic policy gradient (STDPG) agent. STDPG achieves better routing solutions in terms of average end-to-end delay.
arXiv Detail & Related papers (2020-04-21T07:19:07Z)
Dense Residual Network: Enhancing Global Dense Feature Flow for Character Recognition [75.4027660840568]
This paper explores how to enhance the local and global dense feature flow by exploiting hierarchical features fully from all the convolution layers. Technically, we propose an efficient and effective CNN framework, i.e., Fast Dense Residual Network (FDRN) for text recognition.
arXiv Detail & Related papers (2020-01-23T06:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.