MACCIF-TDNN: Multi aspect aggregation of channel and context
interdependence features in TDNN-based speaker verification
- URL: http://arxiv.org/abs/2107.03104v1
- Date: Wed, 7 Jul 2021 09:43:42 GMT
- Title: MACCIF-TDNN: Multi aspect aggregation of channel and context
interdependence features in TDNN-based speaker verification
- Authors: Fangyuan Wang, Zhigang Song, Hongchen Jiang, Bo Xu
- Abstract summary: We propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN)
The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.
- Score: 5.28889161958623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of the recent state-of-the-art results for speaker verification are
achieved by X-vector and its subsequent variants. In this paper, we propose a
new network architecture which aggregates the channel and context
interdependence features from multi aspect based on Time Delay Neural Network
(TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model
the channel interdependence to realize adaptive calibration of channel
features, and process local context features in a multi-scale way at a more
granular level compared with conventional TDNN-based methods. Secondly, we
explore to use the encoder structure of Transformer to model the global context
interdependence features at an utterance level which can capture better long
term temporal characteristics. Before the pooling layer, we aggregate the
outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary
channel and context interdependence features learned by themself respectively.
Finally, instead of performing a single attentive statistics pooling, we also
find it beneficial to extend the pooling method in a multi-head way which can
discriminate features from multiple aspect. The proposed MACCIF-TDNN
architecture can outperform most of the state-of-the-art TDNN-based systems on
VoxCeleb1 test sets.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter
for Speaker Verification [3.0831477850153224]
We introduce a novel module called Global-aware Filter layer (GF layer) in this work.
We present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV)
Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10% together with a relative decline of 20% in computational cost.
arXiv Detail & Related papers (2023-03-20T10:58:12Z) - Spiking Neural Network Decision Feedback Equalization [70.3497683558609]
We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE)
We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels.
The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.
arXiv Detail & Related papers (2022-11-09T09:19:15Z) - Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid
Precoding [94.40747235081466]
We propose an end-to-end deep learning-based joint transceiver design algorithm for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) systems.
We develop a DNN architecture that maps the received pilots into feedback bits at the receiver, and then further maps the feedback bits into the hybrid precoder at the transmitter.
arXiv Detail & Related papers (2021-10-22T20:49:02Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - End-to-End Learning for Uplink MU-SIMO Joint Transmitter and
Non-Coherent Receiver Design in Fading Channels [11.182920270301304]
A novel end-to-end learning approach, namely JTRD-Net, is proposed for uplink multiuser single-input multiple-output (MU-SIMO) joint transmitter and non-coherent receiver design (JTRD) in fading channels.
The transmitter side is modeled as a group of parallel linear layers, which are responsible for multiuser waveform design.
The non-coherent receiver is formed by a deep feed-forward neural network (DFNN) so as to provide multiuser detection (MUD) capabilities.
arXiv Detail & Related papers (2021-05-04T02:47:59Z) - Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice
Separation [40.170868770930774]
Monaural Singing Voice Separation (MSVS) is a challenging task and has been studied for decades.
Deep neural networks (DNNs) are the current state-of-the-art methods for MSVS.
We introduce a Neural Architecture Search (NAS) method to the structure design of DNNs for MSVS.
arXiv Detail & Related papers (2020-08-03T12:09:42Z) - Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Depthwise Separable Convolutions Versus Recurrent Neural Networks for
Monaural Singing Voice Separation [17.358040670413505]
We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs)
We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance.
Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
arXiv Detail & Related papers (2020-07-06T12:32:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.