emotion2vec: Self-Supervised Pre-Training for Speech Emotion
Representation
- URL: http://arxiv.org/abs/2312.15185v1
- Date: Sat, 23 Dec 2023 07:46:55 GMT
- Title: emotion2vec: Self-Supervised Pre-Training for Speech Emotion
Representation
- Authors: Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang
Zhang, Xie Chen
- Abstract summary: We propose emotion2vec, a universal speech emotion representation model.
emotion2vec is pre-trained on unlabeled emotion data through self-supervised online distillation.
It outperforms state-of-the-art pre-trained universal models and emotion specialist models.
- Score: 42.29118614670941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose emotion2vec, a universal speech emotion representation model.
emotion2vec is pre-trained on open-source unlabeled emotion data through
self-supervised online distillation, combining utterance-level loss and
frame-level loss during pre-training. emotion2vec outperforms state-of-the-art
pre-trained universal models and emotion specialist models by only training
linear layers for the speech emotion recognition task on the mainstream IEMOCAP
dataset. In addition, emotion2vec shows consistent improvements among 10
different languages of speech emotion recognition datasets. emotion2vec also
shows excellent results on other emotion tasks, such as song emotion
recognition, emotion prediction in conversation, and sentiment analysis.
Comparison experiments, ablation experiments, and visualization comprehensively
demonstrate the universal capability of the proposed emotion2vec. To the best
of our knowledge, emotion2vec is the first universal representation model in
various emotion-related tasks, filling a gap in the field.
Related papers
- EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [34.03787613163788]
EmoSphere-TTS synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech.
We propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics.
arXiv Detail & Related papers (2024-06-12T01:40:29Z) - Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition [12.605375307094416]
We propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model.
Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion.
arXiv Detail & Related papers (2024-02-22T13:15:49Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Using Knowledge-Embedded Attention to Augment Pre-trained Language
Models for Fine-Grained Emotion Recognition [0.0]
We focus on improving fine-grained emotion recognition by introducing external knowledge into a pre-trained self-attention model.
Our results and error analyses outperform previous models on several datasets.
arXiv Detail & Related papers (2021-07-31T09:41:44Z) - A Circular-Structured Representation for Visual Emotion Distribution
Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning.
To be specific, we first construct an Emotion Circle to unify any emotional state within it.
On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z) - Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network
for Emotional Conversation Generation [25.808037796936766]
In a real-world conversation, we instinctively perceive emotions from multi-source information.
We propose a heterogeneous graph-based model for emotional conversation generation.
Experimental results show that our model can effectively perceive emotions from multi-source knowledge.
arXiv Detail & Related papers (2020-12-09T06:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.