End-to-End Speaker Height and age estimation using Attention Mechanism
with LSTM-RNN
- URL: http://arxiv.org/abs/2101.05056v1
- Date: Wed, 13 Jan 2021 13:41:18 GMT
- Title: End-to-End Speaker Height and age estimation using Attention Mechanism
with LSTM-RNN
- Authors: Manav Kaushik, Van Tung Pham, Eng Siong Chng
- Abstract summary: We propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation.
The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features.
- Score: 24.46321998619126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic height and age estimation of speakers using acoustic features is
widely used for the purpose of human-computer interaction, forensics, etc. In
this work, we propose a novel approach of using attention mechanism to build an
end-to-end architecture for height and age estimation. The attention mechanism
is combined with Long Short-Term Memory(LSTM) encoder which is able to capture
long-term dependencies in the input acoustic features. We modify the
conventionally used Attention -- which calculates context vectors the sum of
attention only across timeframes -- by introducing a modified context vector
which takes into account total attention across encoder units as well, giving
us a new cross-attention mechanism. Apart from this, we also investigate a
multi-task learning approach for jointly estimating speaker height and age. We
train and test our model on the TIMIT corpus. Our model outperforms several
approaches in the literature. We achieve a root mean square error (RMSE) of
6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years
and 8.75years for male and females ages respectively. By tracking the attention
weights allocated to different phones, we find that Vowel phones are most
important whistlestop phones are least important for the estimation task.
Related papers
- On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z) - Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension [21.729875191721984]
We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention.
We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions.
Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
arXiv Detail & Related papers (2024-10-05T15:59:32Z) - HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech [42.688549469089985]
We construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for Alzheimer's Disease detection.
Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation.
By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work.
arXiv Detail & Related papers (2024-05-07T02:19:16Z) - LoCoNet: Long-Short Context Network for Active Speaker Detection [18.06037779826666]
Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video.
We propose LoCoNet, a simple yet effective Long-Short Context Network.
LoCoNet achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2023-01-19T18:54:43Z) - Estimation of speaker age and height from speech signal using bi-encoder
transformer mixture model [3.1447111126464997]
We propose a bi-encoder transformer mixture model for speaker age and height estimation.
Considering the wide differences in male and female voice characteristics, we propose the use of two separate transformer encoders.
We significantly outperform the current state-of-the-art results on age estimation.
arXiv Detail & Related papers (2022-03-22T14:39:56Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z) - Meta-Learning for Short Utterance Speaker Recognition with Imbalance
Length Pairs [65.28795726837386]
We introduce a meta-learning framework for imbalance length pairs.
We train it with a support set of long utterances and a query set of short utterances of varying lengths.
By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models.
arXiv Detail & Related papers (2020-04-06T17:53:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.