Estimation of speaker age and height from speech signal using bi-encoder
transformer mixture model
- URL: http://arxiv.org/abs/2203.11774v1
- Date: Tue, 22 Mar 2022 14:39:56 GMT
- Title: Estimation of speaker age and height from speech signal using bi-encoder
transformer mixture model
- Authors: Tarun Gupta, Duc-Tuan Truong, Tran The Anh, Chng Eng Siong
- Abstract summary: We propose a bi-encoder transformer mixture model for speaker age and height estimation.
Considering the wide differences in male and female voice characteristics, we propose the use of two separate transformer encoders.
We significantly outperform the current state-of-the-art results on age estimation.
- Score: 3.1447111126464997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The estimation of speaker characteristics such as age and height is a
challenging task, having numerous applications in voice forensic analysis. In
this work, we propose a bi-encoder transformer mixture model for speaker age
and height estimation. Considering the wide differences in male and female
voice characteristics such as differences in formant and fundamental
frequencies, we propose the use of two separate transformer encoders for the
extraction of specific voice features in the male and female gender, using
wav2vec 2.0 as a common-level feature extractor. This architecture reduces the
interference effects during backpropagation and improves the generalizability
of the model. We perform our experiments on the TIMIT dataset and significantly
outperform the current state-of-the-art results on age estimation.
Specifically, we achieve root mean squared error (RMSE) of 5.54 years and 6.49
years for male and female age estimation, respectively. Further experiment to
evaluate the relative importance of different phonetic types for our task
demonstrate that vowel sounds are the most distinguishing for age estimation.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants [10.227469020901232]
This paper introduces the Sonos Voice Control Bias Assessment dataset.
1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts.
Results show statistically significant differences in performance across age, dialectal region and ethnicity.
arXiv Detail & Related papers (2024-05-14T12:53:32Z) - Evolution of Voices in French Audiovisual Media Across Genders and Age in a Diachronic Perspective [0.9449650062296824]
We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives.
Speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders.
arXiv Detail & Related papers (2024-04-24T18:00:06Z) - SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
Speech [0.0]
This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications.
Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper.
The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
arXiv Detail & Related papers (2024-03-01T11:28:37Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - VoxCeleb Enrichment for Age and Gender Recognition [12.520037579004883]
We provide speaker age labels and (an alternative) annotation of speaker gender in VoxCeleb datasets.
We demonstrate the use of this metadata by constructing age and gender recognition models.
We also compare the original VoxCeleb gender labels with our labels to identify records that might be mislabeled in the original VoxCeleb data.
arXiv Detail & Related papers (2021-09-28T06:18:57Z) - End-to-End Speaker Height and age estimation using Attention Mechanism
with LSTM-RNN [24.46321998619126]
We propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation.
The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features.
arXiv Detail & Related papers (2021-01-13T13:41:18Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.