LightCAM: A Fast and Light Implementation of Context-Aware Masking based
D-TDNN for Speaker Verification
- URL: http://arxiv.org/abs/2402.06073v2
- Date: Mon, 12 Feb 2024 15:28:38 GMT
- Title: LightCAM: A Fast and Light Implementation of Context-Aware Masking based
D-TDNN for Speaker Verification
- Authors: Di Cao, Xianchen Wang, Junfeng Zhou, Jiakai Zhang, Yanjing Lei and
Wenpeng Chen
- Abstract summary: Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed.
We propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion.
- Score: 3.3800597813242628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art
performance at the cost of high computational complexity and slower inference
speed, making them difficult to implement in an industrial environment. The
Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking
(CAM) module has proven to be an efficient structure to reduce complexity while
maintaining system performance. In this paper, we propose a fast and
lightweight model, LightCAM, which further adopts a depthwise separable
convolution module (DSM) and uses multi-scale feature aggregation (MFA) for
feature fusion at different levels. Extensive experiments are conducted on
VoxCeleb dataset, the comparative results show that it has achieved an EER of
0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other
mainstream speaker verification methods. In addition, complexity analysis
further demonstrates that the proposed architecture has lower computational
cost and faster inference speed.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity
Recognition [16.11114486075643]
PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency.
Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese.
arXiv Detail & Related papers (2024-02-07T13:39:38Z) - Best of Both Worlds: Hybrid SNN-ANN Architecture for Event-based Optical Flow Estimation [12.611797572621398]
Spiking Neural Networks (SNNs) with their asynchronous event-driven compute show great potential for extracting features from event streams.
We propose a novel SNN-ANN hybrid architecture that combines the strengths of both.
arXiv Detail & Related papers (2023-06-05T15:26:02Z) - SpikeSim: An end-to-end Compute-in-Memory Hardware Evaluation Tool for
Benchmarking Spiking Neural Networks [4.0300632886917]
SpikeSim is a tool that can perform realistic performance, energy, latency and area evaluation of IMC-mapped SNNs.
We propose SNN topological modifications leading to 1.24x and 10x reduction in the neuronal module's area and the overall energy-delay-product value.
arXiv Detail & Related papers (2022-10-24T01:07:17Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Learning Frequency-aware Dynamic Network for Efficient Super-Resolution [56.98668484450857]
This paper explores a novel frequency-aware dynamic network for dividing the input into multiple parts according to its coefficients in the discrete cosine transform (DCT) domain.
In practice, the high-frequency part will be processed using expensive operations and the lower-frequency part is assigned with cheap operations to relieve the computation burden.
Experiments conducted on benchmark SISR models and datasets show that the frequency-aware dynamic network can be employed for various SISR neural architectures.
arXiv Detail & Related papers (2021-03-15T12:54:26Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Fully-parallel Convolutional Neural Network Hardware [0.7829352305480285]
We propose a new power-and-area-efficient architecture for implementing Articial Neural Networks (ANNs) in hardware.
For the first time, a fully-parallel CNN as LENET-5 is embedded and tested in a single FPGA.
arXiv Detail & Related papers (2020-06-22T17:19:09Z) - STONNE: A Detailed Architectural Simulator for Flexible Neural Network
Accelerators [5.326345912766044]
STONNE is a cycle-accurate, highly-modular and highly-extensible simulation framework.
We show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation.
arXiv Detail & Related papers (2020-06-10T19:20:52Z) - Toward fast and accurate human pose estimation via soft-gated skip
connections [97.06882200076096]
This paper is on highly accurate and highly efficient human pose estimation.
We re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art.
Our model achieves state-of-the-art results on the MPII and LSP datasets.
arXiv Detail & Related papers (2020-02-25T18:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.