SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
Speech
- URL: http://arxiv.org/abs/2403.00887v1
- Date: Fri, 1 Mar 2024 11:28:37 GMT
- Title: SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
Speech
- Authors: Aron R, Indra Sigicharla, Chirag Periwal, Mohanaprasad K, Nithya
Darisini P S, Sourabh Tiwari, Shivani Arora
- Abstract summary: This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications.
Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper.
The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The interpretation of human voices holds importance across various
applications. This study ventures into predicting age, gender, and emotion from
vocal cues, a field with vast applications. Voice analysis tech advancements
span domains, from improving customer interactions to enhancing healthcare and
retail experiences. Discerning emotions aids mental health, while age and
gender detection are vital in various contexts. Exploring deep learning models
for these predictions involves comparing single, multi-output, and sequential
models highlighted in this paper. Sourcing suitable data posed challenges,
resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work
showed promise in individual predictions, but limited research considered all
three variables simultaneously. This paper identifies flaws in an individual
model approach and advocates for our novel multi-output learning architecture
Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments
suggest that Multi-output models perform comparably to individual models,
efficiently capturing the intricate relationships between variables and speech
inputs, all while achieving improved runtime.
Related papers
- PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis [0.6062751776009752]
We propose a multimodal model capable of predicting Mild Cognitive Impairment and cognitive scores.
The proposed model demonstrates the ability to transcribe and differentiate between languages used in the interviews.
Our approach involves in-depth research to implement various features obtained from the proposed modalities.
arXiv Detail & Related papers (2024-06-11T17:59:31Z) - A Multi-Task, Multi-Modal Approach for Predicting Categorical and
Dimensional Emotions [0.0]
We propose a multi-task, multi-modal system that predicts categorical and dimensional emotions.
Results emphasise the importance of cross-regularisation between the two types of emotions.
arXiv Detail & Related papers (2023-12-31T16:48:03Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - A Multibias-mitigated and Sentiment Knowledge Enriched Transformer for
Debiasing in Multimodal Conversational Emotion Recognition [9.020664590692705]
Multimodal emotion recognition in conversations (mERC) is an active research topic in natural language processing (NLP)
Innumerable implicit prejudices and preconceptions fill human language and conversations.
Existing data-driven mERC approaches may offer higher emotional scores on utterances by females than males.
arXiv Detail & Related papers (2022-07-17T08:16:49Z) - BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for
Conversational Gestures Synthesis [9.95713767110021]
Body-Expression-Audio-Text dataset has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages.
BEAT is the largest motion capture dataset for investigating the human gestures.
arXiv Detail & Related papers (2022-03-10T11:19:52Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Multitask Learning for Emotion and Personality Detection [17.029426018676997]
We build on the known correlation between personality traits and emotional behaviors, and propose a novel multitask learning framework, SoGMTL.
Our more computationally efficient CNN-based multitask model achieves the state-of-the-art performance across multiple famous personality and emotion datasets.
arXiv Detail & Related papers (2021-01-07T03:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.