MELD-ST: An Emotion-aware Speech Translation Dataset
- URL: http://arxiv.org/abs/2405.13233v1
- Date: Tue, 21 May 2024 22:40:38 GMT
- Title: MELD-ST: An Emotion-aware Speech Translation Dataset
- Authors: Sirou Chen, Sakiko Yahata, Shuichiro Shimizu, Zhengdong Yang, Yihang Li, Chenhui Chu, Sadao Kurohashi,
- Abstract summary: We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs.
Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset.
Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings.
- Score: 29.650945917540316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.
Related papers
- BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of emotion-annotated datasets in 28 different languages.
We describe the data collection and annotation processes and the challenges of building these datasets.
We show that BRIGHTER datasets are a step towards bridging the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues [4.894647740789939]
The Akan Conversation Emotion dataset is the first multimodal emotion dialogue dataset for an African language.
It contains 385 emotion-labeled dialogues and 6,162 utterances across audio, visual, and textual modalities.
The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset.
arXiv Detail & Related papers (2025-02-16T03:24:33Z) - Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.
We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.
We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Generative Emotion Cause Explanation in Multimodal Conversations [23.39751445330256]
We propose a new task, textbfMultimodal textbfConversation textbfEmotion textbfCause textbfExplanation (MCECE)
It aims to generate a detailed explanation of the emotional cause to the target utterance within a multimodal conversation scenario.
A novel approach, FAME-Net, is proposed, that harnesses the power of Large Language Models (LLMs) to analyze visual data and accurately interpret the emotions conveyed through facial expressions in videos.
arXiv Detail & Related papers (2024-11-01T09:16:30Z) - Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions.
This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance.
It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - Usefulness of Emotional Prosody in Neural Machine Translation [1.0205541448656992]
We propose to improve translation quality by adding another external source of information: the automatically recognized emotion in the voice.
This work is motivated by the assumption that each emotion is associated with a specific lexicon that can overlap between emotions.
We show that integrating emotion information, especially arousal, into NMT systems leads to better translations.
arXiv Detail & Related papers (2024-04-27T18:04:28Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion [87.18073195745914]
We investigate how well human-annotated emotion triggers correlate with features deemed salient in their prediction of emotions.
Using EmoTrigger, we evaluate the ability of large language models to identify emotion triggers.
Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.
arXiv Detail & Related papers (2023-11-16T06:20:13Z) - Effect of Attention and Self-Supervised Speech Embeddings on
Non-Semantic Speech Tasks [3.570593982494095]
We look at speech emotion understanding as a perception task which is a more realistic setting.
We leverage ComParE rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion.
Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
arXiv Detail & Related papers (2023-08-28T07:11:27Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.