Multi-Frequency Information Enhanced Channel Attention Module for
Speaker Representation Learning
- URL: http://arxiv.org/abs/2207.04540v1
- Date: Sun, 10 Jul 2022 21:19:36 GMT
- Title: Multi-Frequency Information Enhanced Channel Attention Module for
Speaker Representation Learning
- Authors: Mufan Sang, John H.L. Hansen
- Abstract summary: We propose to utilize multi-frequency information and design two novel and effective attention modules.
The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT.
Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations.
- Score: 41.44950556040058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, attention mechanisms have been applied successfully in neural
network-based speaker verification systems. Incorporating the
Squeeze-and-Excitation block into convolutional neural networks has achieved
remarkable performance. However, it uses global average pooling (GAP) to simply
average the features along time and frequency dimensions, which is incapable of
preserving sufficient speaker information in the feature maps. In this study,
we show that GAP is a special case of a discrete cosine transform (DCT) on
time-frequency domain mathematically using only the lowest frequency component
in frequency decomposition. To strengthen the speaker information extraction
ability, we propose to utilize multi-frequency information and design two novel
and effective attention modules, called Single-Frequency Single-Channel (SFSC)
attention module and Multi-Frequency Single-Channel (MFSC) attention module.
The proposed attention modules can effectively capture more speaker information
from multiple frequency components on the basis of DCT. We conduct
comprehensive experiments on the VoxCeleb datasets and a probe evaluation on
the 1st 48-UTD forensic corpus. Experimental results demonstrate that our
proposed SFSC and MFSC attention modules can efficiently generate more
discriminative speaker representations and outperform ResNet34-SE and
ECAPA-TDNN systems with relative 20.9% and 20.2% reduction in EER, without
adding extra network parameters.
Related papers
- Neuromorphic Wireless Split Computing with Multi-Level Spikes [69.73249913506042]
In neuromorphic computing, spiking neural networks (SNNs) perform inference tasks, offering significant efficiency gains for workloads involving sequential data.
Recent advances in hardware and software have demonstrated that embedding a few bits of payload in each spike exchanged between the spiking neurons can further enhance inference accuracy.
This paper investigates a wireless neuromorphic split computing architecture employing multi-level SNNs.
arXiv Detail & Related papers (2024-11-07T14:08:35Z) - Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting [37.721042095518044]
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning.
We propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification.
arXiv Detail & Related papers (2024-06-24T08:14:09Z) - Complementary Frequency-Varying Awareness Network for Open-Set
Fine-Grained Image Recognition [14.450381668547259]
Open-set image recognition is a challenging topic in computer vision.
We propose a Complementary Frequency-varying Awareness Network that could better capture both high-frequency and low-frequency information.
Based on CFAN, we propose an open-set fine-grained image recognition method, called CFAN-OSFGR.
arXiv Detail & Related papers (2023-07-14T08:15:36Z) - Joint Channel Estimation and Feedback with Masked Token Transformers in
Massive MIMO Systems [74.52117784544758]
This paper proposes an encoder-decoder based network that unveils the intrinsic frequency-domain correlation within the CSI matrix.
The entire encoder-decoder network is utilized for channel compression.
Our method outperforms state-of-the-art channel estimation and feedback techniques in joint tasks.
arXiv Detail & Related papers (2023-06-08T06:15:17Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.