Related papers: MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding

MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding

URL: http://arxiv.org/abs/2001.10817v4
Date: Tue, 28 Jul 2020 07:19:37 GMT
Title: MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding
Authors: Soonshin Seo, Ji-Hwan Kim
Abstract summary: We propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of both high-level and lowlevel layers. The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453.
Score: 8.942112181408158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. However, the effect of low-level features was reduced in the speaker embedding encoding. Therefore, we propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of both high-level and lowlevel layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, cross self-attention module is trained the interdependence of each input features. A random masking regularization module also applied to preventing overfitting problem. As such, the MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded to the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453 using the VoxCeleb1 evaluation dataset. These were improved performances compared with the previous self-attentive encoding and state-of-the-art encoding methods.

Related papers

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models [60.857389526958485]
MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
arXiv Detail & Related papers (2025-09-23T09:02:15Z)
Triplet Loss Based Quantum Encoding for Class Separability [2.7641963278515114]
The encoding circuit is trained using a triplet loss function inspired by classical facial recognition algorithms.<n> Benchmark tests performed on various binary classification tasks on MNIST and MedMNIST datasets demonstrate considerable improvement over amplitude encoding with the same VQC structure.
arXiv Detail & Related papers (2025-09-19T07:28:19Z)
Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z)
Rethinking Patch Dependence for Masked Autoencoders [89.02576415930963]
We study the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE)
arXiv Detail & Related papers (2024-01-25T18:49:57Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features [48.62190893209622]
Existing AAC methods only use the high-dimensional representation of the PANNs as the input of the decoder. A new encoder-decoder framework is proposed called the Low- and High-Dimensional Feature Fusion (LHDFF) model for AAC. LHDFF achieves the best performance on the Clotho and AudioCaps datasets compared with other existing models.
arXiv Detail & Related papers (2022-10-10T22:39:41Z)
Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z)
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer [11.814012909512307]
We propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. We leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST.
arXiv Detail & Related papers (2022-03-30T22:06:13Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain [16.70806998451696]
We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
arXiv Detail & Related papers (2022-01-28T11:08:02Z)
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms. With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z)
Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. Recent work has proposed to use representations from different encoder layers for diversified levels of information. We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.