Related papers: Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

URL: http://arxiv.org/abs/2511.00402v1
Date: Sat, 01 Nov 2025 05:01:04 GMT
Title: Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study
Authors: Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Md Shahedul Hasan,
Abstract summary: This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST.<n>We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features.<n>DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB)
Score: 0.41292255339309664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

Related papers

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement [5.149011601951617]
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text.<n>Existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance.<n>We constructed a balanced multi-label sentiment dataset using GoEmotions data, emotion-labeled samples from Sentiment140, and manually annotated texts.<n> Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data.
arXiv Detail & Related papers (2025-11-18T03:06:27Z)
A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts [2.820011731460364]
This study presents a comparative evaluation of four large language models (LLMs) for sentiment analysis and emotion detection in Persian social media texts.<n>The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them.<n>The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts.
arXiv Detail & Related papers (2025-09-18T12:59:07Z)
Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets [0.0]
This study explores human recognition using a three-class subset of the COCO image corpus.<n>The binary Vision Transformer (ViT) achieved 90% mean test accuracy.
arXiv Detail & Related papers (2025-06-13T11:16:50Z)
Emotion Detection in Reddit: Comparative Study of Machine Learning and Deep Learning Techniques [0.0]
This study concentrates on text-based emotion detection by leveraging the GoEmotions dataset. We employed a range of models for this task, including six machine learning models, three ensemble models, and a Long Short-Term Memory (LSTM) model. Results indicate that the Stacking classifier outperforms other models in accuracy and performance.
arXiv Detail & Related papers (2024-11-15T16:28:25Z)
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.<n>ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.<n>This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z)
Improving the Generalizability of Text-Based Emotion Detection by Leveraging Transformers with Psycholinguistic Features [27.799032561722893]
We propose approaches for text-based emotion detection that leverage transformer models (BERT and RoBERTa) in combination with Bidirectional Long Short-Term Memory (BiLSTM) networks trained on a comprehensive set of psycholinguistic features. We find that the proposed hybrid models improve the ability to generalize to out-of-distribution data compared to a standard transformer-based approach.
arXiv Detail & Related papers (2022-12-19T13:58:48Z)
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z)
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline [55.29756535335831]
This work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models. Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic.
arXiv Detail & Related papers (2021-10-13T21:17:15Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.