Is Speech Emotion Recognition Language-Independent? Analysis of English
and Bangla Languages using Language-Independent Vocal Features
- URL: http://arxiv.org/abs/2111.10776v1
- Date: Sun, 21 Nov 2021 09:28:49 GMT
- Title: Is Speech Emotion Recognition Language-Independent? Analysis of English
and Bangla Languages using Language-Independent Vocal Features
- Authors: Fardin Saad, Hasan Mahmud, Md. Alamin Shaheen, Md. Kamrul Hasan,
Paresha Farastu
- Abstract summary: We used Bangla and English languages to assess whether distinguishing emotions from speech is independent of language.
The following emotions were categorized for this study: happiness, anger, neutral, sadness, disgust, and fear.
Although this study reveals that Speech Emotion Recognition (SER) is mostly language-independent, there is some disparity while recognizing emotional states like disgust and fear in these two languages.
- Score: 4.446085353384894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A language agnostic approach to recognizing emotions from speech remains an
incomplete and challenging task. In this paper, we used Bangla and English
languages to assess whether distinguishing emotions from speech is independent
of language. The following emotions were categorized for this study: happiness,
anger, neutral, sadness, disgust, and fear. We employed three Emotional Speech
Sets, of which the first two were developed by native Bengali speakers in
Bangla and English languages separately. The third was the Toronto Emotional
Speech Set (TESS), which was developed by native English speakers from Canada.
We carefully selected language-independent prosodic features, adopted a Support
Vector Machine (SVM) model, and conducted three experiments to carry out our
proposition. In the first experiment, we measured the performance of the three
speech sets individually. This was followed by the second experiment, where we
recorded the classification rate by combining the speech sets. Finally, in the
third experiment we measured the recognition rate by training and testing the
model with different speech sets. Although this study reveals that Speech
Emotion Recognition (SER) is mostly language-independent, there is some
disparity while recognizing emotional states like disgust and fear in these two
languages. Moreover, our investigations inferred that non-native speakers
convey emotions through speech, much like expressing themselves in their native
tongue.
Related papers
- Language-Agnostic Analysis of Speech Depression Detection [2.5764071253486636]
This work analyzes automatic speech-based depression detection across two languages, English and Malayalam.
A CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages.
Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems.
arXiv Detail & Related papers (2024-09-23T07:35:56Z) - BANSpEmo: A Bangla Emotional Speech Recognition Dataset [0.0]
This corpus contains 792 audio recordings over a duration of more than 1 hour and 23 minutes.
The data set consists of 12 Bangla sentences which are uttered in 6 emotions as Disgust, Happy, Sad, Surprised, Anger, and Fear.
BanSpEmo can be considered as a useful resource to promote emotion and speech recognition research and related applications in the Bangla language.
arXiv Detail & Related papers (2023-12-21T16:52:41Z) - AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect
Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations.
We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space.
We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z) - Learning Multilingual Expressive Speech Representation for Prosody
Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units.
We show that this embedding can be used to predict the pitch and duration of speech units in a target language.
We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z) - Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech
Emotion Recognition [48.29355616574199]
We analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese, and Cantonese.
This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method.
arXiv Detail & Related papers (2023-06-26T08:48:08Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - A study on native American English speech recognition by Indian
listeners with varying word familiarity level [62.14295630922855]
We have three kinds of responses from each listener while they recognize an utterance.
From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.
Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
arXiv Detail & Related papers (2021-12-08T07:43:38Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.