Fine-tuning Wav2vec for Vocal-burst Emotion Recognition
- URL: http://arxiv.org/abs/2210.00263v1
- Date: Sat, 1 Oct 2022 12:03:27 GMT
- Title: Fine-tuning Wav2vec for Vocal-burst Emotion Recognition
- Authors: Dang-Khanh Nguyen, Sudarshan Pant, Ngoc-Huynh Ho, Guee-Sang Lee,
Soo-Huyng Kim, Hyung-Jeong Yang
- Abstract summary: The ACII Vocal Affective Bursts (A-VB) competition introduces a new topic in affective computing.
The vocal bursts such as laughs, cries, and signs are not exploited even though they are very informative for behavior analysis.
This technical report describes the method and the result of SclabCNU Team for the tasks of the challenge.
- Score: 7.910908058662372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ACII Affective Vocal Bursts (A-VB) competition introduces a new topic in
affective computing, which is understanding emotional expression using the
non-verbal sound of humans. We are familiar with emotion recognition via verbal
vocal or facial expression. However, the vocal bursts such as laughs, cries,
and signs, are not exploited even though they are very informative for behavior
analysis. The A-VB competition comprises four tasks that explore non-verbal
information in different spaces. This technical report describes the method and
the result of SclabCNU Team for the tasks of the challenge. We achieved
promising results compared to the baseline model provided by the organizers.
Related papers
- Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - emotion2vec: Self-Supervised Pre-Training for Speech Emotion
Representation [42.29118614670941]
We propose emotion2vec, a universal speech emotion representation model.
emotion2vec is pre-trained on unlabeled emotion data through self-supervised online distillation.
It outperforms state-of-the-art pre-trained universal models and emotion specialist models.
arXiv Detail & Related papers (2023-12-23T07:46:55Z) - Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions.
We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts.
Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z) - ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit
Detection & Emotional Reaction Intensity Estimation Challenges [62.413819189049946]
5th Affective Behavior Analysis in-the-wild (ABAW) Competition is part of the respective ABAW Workshop which will be held in conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR), 2023.
For this year's Competition, we feature two corpora: i) an extended version of the Aff-Wild2 database and ii) the Hume-Reaction dataset.
The latter dataset is an audiovisual one in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities.
arXiv Detail & Related papers (2023-03-02T18:58:15Z) - Describing emotions with acoustic property prompts for speech emotion
recognition [30.990720176317463]
We devise a method to automatically create a description for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate.
We train a neural network model using these audio-text pairs and evaluate the model using one more dataset.
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
arXiv Detail & Related papers (2022-11-14T20:29:37Z) - Self-Supervised Attention Networks and Uncertainty Loss Weighting for
Multi-Task Emotion Recognition on Vocal Bursts [5.3802825558183835]
We present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB)
Our approach surpasses the challenge baseline by a wide margin on all four tasks.
arXiv Detail & Related papers (2022-09-15T15:50:27Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - AHD ConvNet for Speech Emotion Classification [0.0]
We propose a novel mel spectrogram learning approach in which our model uses the datapoints to learn emotions from the given wav form voice notes in the popular CREMA-D dataset.
It took less training time compared to other approaches used to address the problem of emotion speech recognition.
arXiv Detail & Related papers (2022-06-10T11:57:28Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.