Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations
- URL: http://arxiv.org/abs/2009.00768v2
- Date: Wed, 9 Sep 2020 16:56:31 GMT
- Title: Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations
- Authors: Wei Xia, John H.L. Hansen
- Abstract summary: We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
- Score: 67.18006078950337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we propose the global context guided channel and
time-frequency transformations to model the long-range, non-local
time-frequency dependencies and channel variances in speaker representations.
We use the global context information to enhance important channels and
recalibrate salient time-frequency locations by computing the similarity
between the global context and local features. The proposed modules, together
with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset,
which is a large scale speaker verification corpus collected in the wild. This
lightweight block can be easily incorporated into a CNN model with little
additional computational costs and effectively improves the speaker
verification performance compared to the baseline ResNet-LDE model and the
Squeeze&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact the performance of the
proposed modules. We find that by employing the proposed L2-tf-GTFC
transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a
relative 32.68% reduction, and a relative 27.28% improvement in terms of the
DCF score. The results indicate that our proposed global context guided
transformation modules can efficiently improve the learned speaker
representations by achieving time-frequency and channel-wise feature
recalibration.
Related papers
- R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [83.77114091471822]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)
A challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.
This is particularly pronounced for word embedding parameters in large language models (LLMs), which are crucial for language understanding.
A physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z) - Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation [0.035984704795350306]
Multivariant time series (MTS) data are usually incomplete in real scenarios.
We propose a Score-weighted Convolutional Diffusion Model (Score-CDM) for short, whose backbone consists of a Score-weighted Convolution Module (SCM) and an Adaptive Reception Module (ARM)
We conduct extensive evaluations on three real MTS datasets of different domains, and the result verifies the effectiveness of the proposed Score-CDM.
arXiv Detail & Related papers (2024-05-21T02:00:55Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Speech Enhancement with Perceptually-motivated Optimization and Dual
Transformations [5.4878772986187565]
We propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE.
Our proposed model achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27% smaller than the SOTA.
With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
arXiv Detail & Related papers (2022-09-24T02:33:40Z) - Multi-Frequency Information Enhanced Channel Attention Module for
Speaker Representation Learning [41.44950556040058]
We propose to utilize multi-frequency information and design two novel and effective attention modules.
The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT.
Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations.
arXiv Detail & Related papers (2022-07-10T21:19:36Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z) - Delay Minimization for Federated Learning Over Wireless Communication
Networks [172.42768672943365]
The problem of delay computation for federated learning (FL) over wireless communication networks is investigated.
A bisection search algorithm is proposed to obtain the optimal solution.
Simulation results show that the proposed algorithm can reduce delay by up to 27.3% compared to conventional FL methods.
arXiv Detail & Related papers (2020-07-05T19:00:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.