Multi-View Self-Attention Based Transformer for Speaker Recognition
- URL: http://arxiv.org/abs/2110.05036v1
- Date: Mon, 11 Oct 2021 07:03:23 GMT
- Title: Multi-View Self-Attention Based Transformer for Speaker Recognition
- Authors: Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing
Li, Yu Zhang
- Abstract summary: Transformer model is widely used for speech processing tasks such as speaker recognition.
We propose a novel multi-view self-attention mechanism for speaker Transformer.
We show that the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
- Score: 33.21173007319178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Initially developed for natural language processing (NLP), Transformer model
is now widely used for speech processing tasks such as speaker recognition, due
to its powerful sequence modeling capabilities. However, conventional
self-attention mechanisms are originally designed for modeling textual sequence
without considering the characteristics of speech and speaker modeling.
Besides, different Transformer variants for speaker recognition have not been
well studied. In this work, we propose a novel multi-view self-attention
mechanism and present an empirical study of different Transformer variants with
or without the proposed attention mechanism for speaker recognition.
Specifically, to balance the capabilities of capturing global dependencies and
modeling the locality, we propose a multi-view self-attention mechanism for
speaker Transformer, in which different attention heads can attend to different
ranges of the receptive field. Furthermore, we introduce and compare five
Transformer variants with different network architectures, embedding locations,
and pooling methods to learn speaker embeddings. Experimental results on the
VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view
self-attention mechanism achieves improvement in the performance of speaker
recognition, and the proposed speaker Transformer network attains excellent
results compared with state-of-the-art models.
Related papers
- Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - Improving Transformer-based Conversational ASR by Inter-Sentential
Attention Mechanism [20.782319059183173]
We propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition.
We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
arXiv Detail & Related papers (2022-07-02T17:17:47Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - A Hierarchical Transformer with Speaker Modeling for Emotion Recognition
in Conversation [12.065178204539693]
Emotion Recognition in Conversation (ERC) is a personalized and interactive emotion recognition task.
Current method models speakers' interactions by building a relation between every two speakers.
We simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies.
arXiv Detail & Related papers (2020-12-29T14:47:35Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance.
We present speaker information in the form of speaker embeddings for each of the speakers.
We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.