A Lightweight Speaker Recognition System Using Timbre Properties
- URL: http://arxiv.org/abs/2010.05502v2
- Date: Tue, 13 Oct 2020 05:58:43 GMT
- Title: A Lightweight Speaker Recognition System Using Timbre Properties
- Authors: Abu Quwsar Ohi, M. F. Mridha, Md. Abdul Hamid, Muhammad Mostafa
Monowar, Dongsu Lee, Jinsul Kim
- Abstract summary: We propose a lightweight text-independent speaker recognition model based on random forest classifier.
It also introduces new features that are used for both speaker verification and identification tasks.
The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
- Score: 0.5708902722746041
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speaker recognition is an active research area that contains notable usage in
biometric security and authentication system. Currently, there exist many
well-performing models in the speaker recognition domain. However, most of the
advanced models implement deep learning that requires GPU support for real-time
speech recognition, and it is not suitable for low-end devices. In this paper,
we propose a lightweight text-independent speaker recognition model based on
random forest classifier. It also introduces new features that are used for
both speaker verification and identification tasks. The proposed model uses
human speech based timbral properties as features that are classified using
random forest. Timbre refers to the very basic properties of sound that allow
listeners to discriminate among them. The prototype uses seven most actively
searched timbre properties, boominess, brightness, depth, hardness, roughness,
sharpness, and warmth as features of our speaker recognition model. The
experiment is carried out on speaker verification and speaker identification
tasks and shows the achievements and drawbacks of the proposed model. In the
speaker identification phase, it achieves a maximum accuracy of 78%. On the
contrary, in the speaker verification phase, the model maintains an accuracy of
80% having an equal error rate (ERR) of 0.24.
Related papers
- Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Developing Acoustic Models for Automatic Speech Recognition in Swedish [6.5458610824731664]
This paper is concerned with automatic continuous speech recognition using trainable systems.
The aim of this work is to build acoustic models for spoken Swedish.
arXiv Detail & Related papers (2024-04-25T12:03:14Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - An Effective Transformer-based Contextual Model and Temporal Gate
Pooling for Speaker Identification [0.0]
This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model.
We propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification.
The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters.
arXiv Detail & Related papers (2023-08-22T07:34:07Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Retrieving Speaker Information from Personalized Acoustic Models for
Speech Recognition [5.1229352884025845]
We show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker.
In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker.
arXiv Detail & Related papers (2021-11-07T22:17:52Z) - Improving on-device speaker verification using federated learning with
privacy [5.321241042620525]
Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy.
This paper investigates how privacy-preserving learning can improve a speaker verification system.
arXiv Detail & Related papers (2020-08-06T13:37:14Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.