Improving Transformer-based Networks With Locality For Automatic Speaker
Verification
- URL: http://arxiv.org/abs/2302.08639v1
- Date: Fri, 17 Feb 2023 01:04:51 GMT
- Title: Improving Transformer-based Networks With Locality For Automatic Speaker
Verification
- Authors: Mufan Sang, Yong Zhao, Gang Liu, John H.L. Hansen, Jian Wu
- Abstract summary: Transformer-based architectures have been explored for speaker embedding extraction.
In this study, we enhance the Transformer with the locality modeling in two directions.
We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset.
- Score: 40.06788577864032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer-based architectures have been explored for speaker
embedding extraction. Although the Transformer employs the self-attention
mechanism to efficiently model the global interaction between token embeddings,
it is inadequate for capturing short-range local context, which is essential
for the accurate extraction of speaker information. In this study, we enhance
the Transformer with the locality modeling in two directions. First, we propose
the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise
convolution and channel-wise attention into the Conformer blocks. Second, we
present the Speaker Swin Transformer (SST) by adapting the Swin Transformer,
originally proposed for vision tasks, into speaker embedding network. We
evaluate the proposed approaches on the VoxCeleb datasets and a large-scale
Microsoft internal multilingual (MS-internal) dataset. The proposed models
achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed
Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN.
When trained on the MS-internal dataset, the proposed models achieve promising
results with 14.6% relative reduction in EER over the Res2Net50 model.
Related papers
- Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - TransfoRNN: Capturing the Sequential Information in Self-Attention
Representations for Language Modeling [9.779600950401315]
We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential information.
We found that the TransfoRNN models which consists of only shallow Transformers stack is suffice to give comparable, if not better, performance.
arXiv Detail & Related papers (2021-04-04T09:31:18Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Multitask Learning and Joint Optimization for Transformer-RNN-Transducer
Speech Recognition [13.198689566654107]
This paper explores multitask learning, joint optimization, and joint decoding methods for transformer-RNN-transducer systems.
We show that the proposed methods can reduce word error rate (WER) by 16.6 % and 13.3 % for test-clean and test-other datasets, respectively.
arXiv Detail & Related papers (2020-11-02T06:38:06Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.