Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT
Based on the Quran Reciters Dataset
- URL: http://arxiv.org/abs/2111.06331v1
- Date: Thu, 11 Nov 2021 17:44:50 GMT
- Title: Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT
Based on the Quran Reciters Dataset
- Authors: Aly Moustafa and Salah A. Aly
- Abstract summary: We develop a deep learning model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.
The experiments ensure that an arbitrary wave signal for a certain speaker can be identified with 98% and 97.1% accuracies.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current authentication and trusted systems depend on classical and biometric
methods to recognize or authorize users. Such methods include audio speech
recognitions, eye, and finger signatures. Recent tools utilize deep learning
and transformers to achieve better results. In this paper, we develop a deep
learning constructed model for Arabic speakers identification by using
Wav2Vec2.0 and HuBERT audio representation learning tools. The end-to-end
Wav2Vec2.0 paradigm acquires contextualized speech representations learnings by
randomly masking a set of feature vectors, and then applies a transformer
neural network. We employ an MLP classifier that is able to differentiate
between invariant labeled classes. We show several experimental results that
safeguard the high accuracy of the proposed model. The experiments ensure that
an arbitrary wave signal for a certain speaker can be identified with 98% and
97.1% accuracies in the cases of Wav2Vec2.0 and HuBERT, respectively.
Related papers
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - An Effective Transformer-based Contextual Model and Temporal Gate
Pooling for Speaker Identification [0.0]
This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model.
We propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification.
The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters.
arXiv Detail & Related papers (2023-08-22T07:34:07Z) - Speaker and Language Change Detection using Wav2vec2 and Whisper [1.9594639581421422]
We investigate transformer networks pre-trained for automatic speech recognition for their ability to detect speaker and language changes in speech.
We show that these capabilities are definitely there, with speaker recognition equal error rates of the order of 10% and language detection error rates of a few percent.
arXiv Detail & Related papers (2023-02-18T16:45:30Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based
on BAVED Dataset [0.0]
This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues.
The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT.
The experiment and performance results of our model overcome the previous known outcomes.
arXiv Detail & Related papers (2021-10-09T00:58:12Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.