Speech & Song Emotion Recognition Using Multilayer Perceptron and
Standard Vector Machine
- URL: http://arxiv.org/abs/2105.09406v1
- Date: Wed, 19 May 2021 21:28:05 GMT
- Title: Speech & Song Emotion Recognition Using Multilayer Perceptron and
Standard Vector Machine
- Authors: Behzad Javaheri
- Abstract summary: We have compared the performance of SVM and in emotion recognition using speech and song channels of the RAVDESS dataset.
optimised SVM outperforms with an accuracy of 82 compared to 75%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Herein, we have compared the performance of SVM and MLP in emotion
recognition using speech and song channels of the RAVDESS dataset. We have
undertaken a journey to extract various audio features, identify optimal
scaling strategy and hyperparameter for our models. To increase sample size, we
have performed audio data augmentation and addressed data imbalance using
SMOTE. Our data indicate that optimised SVM outperforms MLP with an accuracy of
82 compared to 75%. Following data augmentation, the performance of both
algorithms was identical at ~79%, however, overfitting was evident for the SVM.
Our final exploration indicated that the performance of both SVM and MLP were
similar in which both resulted in lower accuracy for the speech channel
compared to the song channel. Our findings suggest that both SVM and MLP are
powerful classifiers for emotion recognition in a vocal-dependent manner.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models [90.99663022952498]
SuperB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks.
SuperB incurs high computational costs due to the large datasets and diverse tasks.
We introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.
arXiv Detail & Related papers (2023-05-30T13:07:33Z) - Emotional Expression Detection in Spoken Language Employing Machine
Learning Algorithms [0.0]
There are a variety of features of the human voice that can be classified as pitch, timbre, loudness, and vocal tone.
It is observed in numerous incidents that human expresses their feelings using different vocal qualities when they are speaking.
The primary objective of this research is to recognize different emotions of human beings by using several functions namely, spectral descriptors, periodicity, and harmonicity.
arXiv Detail & Related papers (2023-04-20T17:57:08Z) - A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit [2.969929079464237]
We show that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset.
Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets.
arXiv Detail & Related papers (2023-02-27T20:46:36Z) - SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation [11.91301106502376]
SpeechBlender is a fine-grained data augmentation pipeline for generating mispronunciation errors.
Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models.
arXiv Detail & Related papers (2022-11-02T07:13:30Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - A Hybrid MLP-SVM Model for Classification using Spatial-Spectral
Features on Hyper-Spectral Images [1.648438955311779]
We make a hybrid classifier (MLP-SVM) using multilayer perceptron (MLP) and support vector machine (SVM)
outputs from the last hidden layer of the neural net-ork become the input to the SVM, which finally classifies into various desired classes.
The proposed method significantly increases the accuracy on testing dataset to 93.22%, 96.87%, 93.81% as compare to 86.97%, 88.58%, 88.85% and 91.61%, 96.20%, 90.68% based on individual classifiers SVM and on Indian Pines, U. Pavia and
arXiv Detail & Related papers (2021-01-01T11:47:23Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.