Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion
Recognition
- URL: http://arxiv.org/abs/2309.10294v1
- Date: Tue, 19 Sep 2023 03:52:01 GMT
- Title: Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion
Recognition
- Authors: Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang
Zhang, Xie Chen
- Abstract summary: We investigated the representation ability of different speech self-supervised pre-trained models.
We employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech.
- Score: 42.09340937787435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explored how to boost speech emotion recognition (SER) with
the state-of-the-art speech pre-trained model (PTM), data2vec, text generation
technique, GPT-4, and speech synthesis technique, Azure TTS. First, we
investigated the representation ability of different speech self-supervised
pre-trained models, and we found that data2vec has a good representation
ability on the SER task. Second, we employed a powerful large language model
(LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate
emotionally congruent text and speech. We carefully designed the text prompt
and dataset construction, to obtain the synthetic emotional speech data with
high quality. Third, we studied different ways of data augmentation to promote
the SER task with synthetic speech, including random mixing, adversarial
training, transfer learning, and curriculum learning. Experiments and ablation
studies on the IEMOCAP dataset demonstrate the effectiveness of our method,
compared with other data augmentation methods, and data augmentation with other
synthetic data.
Related papers
- BLSP-Emo: Towards Empathetic Large Speech-Language Models [34.62210186235263]
We present BLSP-Emo, a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech.
Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses.
arXiv Detail & Related papers (2024-06-06T09:02:31Z) - MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.