State-of-the-art in speaker recognition
- URL: http://arxiv.org/abs/2202.12705v1
- Date: Wed, 23 Feb 2022 11:49:09 GMT
- Title: State-of-the-art in speaker recognition
- Authors: Marcos Faundez-Zanuy, Enric Monte-Moreno
- Abstract summary: Recent advances in speech technologies have produced new tools to improve speaker recognition.
Speaker recognition is far away from being a technology where all the possibilities have already been explored.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in speech technologies have produced new tools that can be
used to improve the performance and flexibility of speaker recognition While
there are few degrees of freedom or alternative methods when using fingerprint
or iris identification techniques, speech offers much more flexibility and
different levels for performing recognition: the system can force the user to
speak in a particular manner, different for each attempt to enter. Also with
voice input the system has other degrees of freedom, such as the use of
knowledge/codes that only the user knows, or dialectical/semantical traits that
are difficult to forge. This paper offers and overview of the state of the art
in speaker recognition, with special emphasis on the pros and contras, and the
current research lines. The current research lines include improved
classification systems, and the use of high level information by means of
probabilistic grammars. In conclusion, speaker recognition is far away from
being a technology where all the possibilities have already been explored.
Related papers
- "It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services [3.8931913630405393]
This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach.
Our findings reveal technical performance disparities across five regional, English-language accents.
Current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination.
arXiv Detail & Related papers (2025-04-12T21:31:22Z) - Controlling Emotion in Text-to-Speech with Natural Language Prompts [29.013577423045255]
We propose a system conditioned on embeddings derived from an emotionally rich text iteration that serves as prompt.
A joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture.
Our approach is trained on merged emotional speech and text datasets and varies prompts in each training to increase the generalization capabilities of the model.
arXiv Detail & Related papers (2024-06-10T15:58:42Z) - The evaluation of a code-switched Sepedi-English automatic speech
recognition system [0.0]
We present the evaluation of the Sepedi-English code-switched automatic speech recognition system.
This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach.
The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.
arXiv Detail & Related papers (2024-03-11T15:11:28Z) - Deep Learning-based Spatio Temporal Facial Feature Visual Speech
Recognition [0.0]
We present an alternate authentication process that makes use of both facial recognition and the individual's distinctive temporal facial feature motions while they speak a password.
The suggested model attained an accuracy of 96.1% when tested on the industry-standard MIRACL-VC1 dataset.
arXiv Detail & Related papers (2023-04-30T18:52:29Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - A Review of Speaker Diarization: Recent Advances with Deep Learning [78.20151731627958]
Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity.
With the rise of deep learning technology, more rapid advancements have been made for speaker diarization.
We discuss how speaker diarization systems have been integrated with speech recognition applications.
arXiv Detail & Related papers (2021-01-24T01:28:05Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.