Related papers: emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

URL: http://arxiv.org/abs/2312.15185v1
Date: Sat, 23 Dec 2023 07:46:55 GMT
Title: emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Authors: Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen
Abstract summary: We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on unlabeled emotion data through self-supervised online distillation. It outperforms state-of-the-art pre-trained universal models and emotion specialist models.
Score: 42.29118614670941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.

Related papers

UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech [34.89118596727314]
We propose UDDETTS, a neural language model unifying discrete and dimensional emotions for controllable emotional TTS.<n>This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values.<n>Experiments show that UDDETTS unifies linear emotion control along the three dimensions of ADV space, and exhibits superior end-to-end emotional speech synthesis capabilities.
arXiv Detail & Related papers (2025-05-15T12:57:19Z)
EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech [34.03787613163788]
EmoSphere-TTS synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. We propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics.
arXiv Detail & Related papers (2024-06-12T01:40:29Z)
Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition [12.605375307094416]
We propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion.
arXiv Detail & Related papers (2024-02-22T13:15:49Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z)
Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z)
Using Knowledge-Embedded Attention to Augment Pre-trained Language Models for Fine-Grained Emotion Recognition [0.0]
We focus on improving fine-grained emotion recognition by introducing external knowledge into a pre-trained self-attention model. Our results and error analyses outperform previous models on several datasets.
arXiv Detail & Related papers (2021-07-31T09:41:44Z)
A Circular-Structured Representation for Visual Emotion Distribution Learning [82.89776298753661]
We propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes.
arXiv Detail & Related papers (2021-06-23T14:53:27Z)
Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation [25.808037796936766]
In a real-world conversation, we instinctively perceive emotions from multi-source information. We propose a heterogeneous graph-based model for emotional conversation generation. Experimental results show that our model can effectively perceive emotions from multi-source knowledge.
arXiv Detail & Related papers (2020-12-09T06:09:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.