Related papers: End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

URL: http://arxiv.org/abs/2101.05056v1
Date: Wed, 13 Jan 2021 13:41:18 GMT
Title: End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN
Authors: Manav Kaushik, Van Tung Pham, Eng Siong Chng
Abstract summary: We propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features.
Score: 24.46321998619126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term dependencies in the input acoustic features. We modify the conventionally used Attention -- which calculates context vectors the sum of attention only across timeframes -- by introducing a modified context vector which takes into account total attention across encoder units as well, giving us a new cross-attention mechanism. Apart from this, we also investigate a multi-task learning approach for jointly estimating speaker height and age. We train and test our model on the TIMIT corpus. Our model outperforms several approaches in the literature. We achieve a root mean square error (RMSE) of 6.92cm and6.34cm for male and female heights respectively and RMSE of 7.85years and 8.75years for male and females ages respectively. By tracking the attention weights allocated to different phones, we find that Vowel phones are most important whistlestop phones are least important for the estimation task.

Related papers

Multi-granularity Interest Retrieval and Refinement Network for Long-Term User Behavior Modeling in CTR Prediction [68.90783662117936]
Click-through Rate (CTR) prediction is crucial for online personalization platforms. Recent advancements have shown that modeling rich user behaviors can significantly improve the performance of CTR prediction. We propose Multi-granularity Interest Retrieval and Refinement Network (MIRRN)
arXiv Detail & Related papers (2024-11-22T15:29:05Z)
On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z)
Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension [21.729875191721984]
We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention. We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions. Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
arXiv Detail & Related papers (2024-10-05T15:59:32Z)
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech [42.688549469089985]
We construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for Alzheimer's Disease detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work.
arXiv Detail & Related papers (2024-05-07T02:19:16Z)
LoCoNet: Long-Short Context Network for Active Speaker Detection [18.06037779826666]
Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. We propose LoCoNet, a simple yet effective Long-Short Context Network. LoCoNet achieves state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2023-01-19T18:54:43Z)
Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model [3.1447111126464997]
We propose a bi-encoder transformer mixture model for speaker age and height estimation. Considering the wide differences in male and female voice characteristics, we propose the use of two separate transformer encoders. We significantly outperform the current state-of-the-art results on age estimation.
arXiv Detail & Related papers (2022-03-22T14:39:56Z)
Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z)
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation. Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms. With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z)
Temporal Memory Relation Network for Workflow Recognition from Surgical Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns. We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z)
Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z)
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs [65.28795726837386]
We introduce a meta-learning framework for imbalance length pairs. We train it with a support set of long utterances and a query set of short utterances of varying lengths. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models.
arXiv Detail & Related papers (2020-04-06T17:53:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.