SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
  Speech
        - URL: http://arxiv.org/abs/2403.00887v1
 - Date: Fri, 1 Mar 2024 11:28:37 GMT
 - Title: SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in
  Speech
 - Authors: Aron R, Indra Sigicharla, Chirag Periwal, Mohanaprasad K, Nithya
  Darisini P S, Sourabh Tiwari, Shivani Arora
 - Abstract summary: This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications.
 Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper.
The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
 - Score: 0.0
 - License: http://creativecommons.org/licenses/by-sa/4.0/
 - Abstract:   The interpretation of human voices holds importance across various
applications. This study ventures into predicting age, gender, and emotion from
vocal cues, a field with vast applications. Voice analysis tech advancements
span domains, from improving customer interactions to enhancing healthcare and
retail experiences. Discerning emotions aids mental health, while age and
gender detection are vital in various contexts. Exploring deep learning models
for these predictions involves comparing single, multi-output, and sequential
models highlighted in this paper. Sourcing suitable data posed challenges,
resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work
showed promise in individual predictions, but limited research considered all
three variables simultaneously. This paper identifies flaws in an individual
model approach and advocates for our novel multi-output learning architecture
Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments
suggest that Multi-output models perform comparably to individual models,
efficiently capturing the intricate relationships between variables and speech
inputs, all while achieving improved runtime.
 
       
      
        Related papers
        - UniConv: Unifying Retrieval and Response Generation for Large Language   Models in Conversations [71.79210031338464]
We show how to unify dense retrieval and response generation for large language models in conversation.<n>We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks.<n>The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
arXiv  Detail & Related papers  (2025-07-09T17:02:40Z) - Personality Prediction from Life Stories using Language Models [12.851871085845499]
In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits.<n>We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability.
arXiv  Detail & Related papers  (2025-06-24T02:39:06Z) - CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for   Fair Speech Emotion Recognition [49.27067541740956]
We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information.<n>CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples.<n>Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
arXiv  Detail & Related papers  (2025-06-06T13:25:56Z) - Experimenting with Affective Computing Models in Video Interviews with   Spanish-speaking Older Adults [2.4866182704905495]
This study evaluates state-of-the-art affective computing models using videos of older adults interacting with either a person or a virtual avatar.
As part of this effort, we introduce a novel dataset featuring Spanish-speaking older adults engaged in human-to-human video interviews.
arXiv  Detail & Related papers  (2025-01-28T11:42:15Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv  Detail & Related papers  (2024-07-17T08:13:22Z) - Cognitive Insights Across Languages: Enhancing Multimodal Interview   Analysis [0.6062751776009752]
We propose a multimodal model capable of predicting Mild Cognitive Impairment and cognitive scores.
The proposed model demonstrates the ability to transcribe and differentiate between languages used in the interviews.
Our approach involves in-depth research to implement various features obtained from the proposed modalities.
arXiv  Detail & Related papers  (2024-06-11T17:59:31Z) - A Multi-Task, Multi-Modal Approach for Predicting Categorical and
  Dimensional Emotions [0.0]
We propose a multi-task, multi-modal system that predicts categorical and dimensional emotions.
Results emphasise the importance of cross-regularisation between the two types of emotions.
arXiv  Detail & Related papers  (2023-12-31T16:48:03Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
  Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv  Detail & Related papers  (2023-03-14T16:08:45Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
  Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv  Detail & Related papers  (2022-07-20T13:37:57Z) - A Multibias-mitigated and Sentiment Knowledge Enriched Transformer for
  Debiasing in Multimodal Conversational Emotion Recognition [9.020664590692705]
Multimodal emotion recognition in conversations (mERC) is an active research topic in natural language processing (NLP)
Innumerable implicit prejudices and preconceptions fill human language and conversations.
Existing data-driven mERC approaches may offer higher emotional scores on utterances by females than males.
arXiv  Detail & Related papers  (2022-07-17T08:16:49Z) - BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for
  Conversational Gestures Synthesis [9.95713767110021]
Body-Expression-Audio-Text dataset has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages.
BEAT is the largest motion capture dataset for investigating the human gestures.
arXiv  Detail & Related papers  (2022-03-10T11:19:52Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
  Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv  Detail & Related papers  (2022-03-03T20:52:47Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
  Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv  Detail & Related papers  (2022-02-08T18:36:52Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
  Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv  Detail & Related papers  (2021-10-18T08:52:31Z) - Multitask Learning for Emotion and Personality Detection [17.029426018676997]
We build on the known correlation between personality traits and emotional behaviors, and propose a novel multitask learning framework, SoGMTL.
Our more computationally efficient CNN-based multitask model achieves the state-of-the-art performance across multiple famous personality and emotion datasets.
arXiv  Detail & Related papers  (2021-01-07T03:09:55Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.