An Effective Transformer-based Contextual Model and Temporal Gate
Pooling for Speaker Identification
- URL: http://arxiv.org/abs/2308.11241v2
- Date: Sun, 10 Sep 2023 17:43:52 GMT
- Title: An Effective Transformer-based Contextual Model and Temporal Gate
Pooling for Speaker Identification
- Authors: Harunori Kawano and Sota Shimizu
- Abstract summary: This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model.
We propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification.
The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wav2vec2 has achieved success in applying Transformer architecture and
self-supervised learning to speech recognition. Recently, these have come to be
used not only for speech recognition but also for the entire speech processing.
This paper introduces an effective end-to-end speaker identification model
applied Transformer-based contextual model. We explored the relationship
between the hyper-parameters and the performance in order to discern the
structure of an effective model. Furthermore, we propose a pooling method,
Temporal Gate Pooling, with powerful learning ability for speaker
identification. We applied Conformer as encoder and BEST-RQ for pre-training
and conducted an evaluation utilizing the speaker identification of VoxCeleb1.
The proposed method has achieved an accuracy of 87.1% with 28.5M parameters,
demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is
available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT
Based on the Quran Reciters Dataset [0.0]
We develop a deep learning model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.
The experiments ensure that an arbitrary wave signal for a certain speaker can be identified with 98% and 97.1% accuracies.
arXiv Detail & Related papers (2021-11-11T17:44:50Z) - Fine-tuning wav2vec2 for speaker recognition [3.69563307866315]
We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding.
To adapt the framework to speaker recognition, we propose a single-utterance classification variant with CE or AAM softmax loss, and an utterance-pair classification variant with BCE loss.
arXiv Detail & Related papers (2021-09-30T12:16:47Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - A Lightweight Speaker Recognition System Using Timbre Properties [0.5708902722746041]
We propose a lightweight text-independent speaker recognition model based on random forest classifier.
It also introduces new features that are used for both speaker verification and identification tasks.
The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
arXiv Detail & Related papers (2020-10-12T07:56:03Z) - Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance.
We present speaker information in the form of speaker embeddings for each of the speakers.
We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.