Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
- URL: http://arxiv.org/abs/2512.18298v1
- Date: Sat, 20 Dec 2025 10:05:58 GMT
- Title: Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
- Authors: Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee,
- Abstract summary: Speech Emotion Recognition systems often degrade in performance when exposed to unpredictable acoustic interference.<n>We propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks.
- Score: 2.0391237204597363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.
Related papers
- AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z) - Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems [0.0]
Keywords Spotting systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges.<n>We propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency.<n>The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features.
arXiv Detail & Related papers (2026-01-22T17:59:31Z) - RMSup: Physics-Informed Radio Map Super-Resolution for Compute-Enhanced Integrated Sensing and Communications [28.003646295374022]
We present RMSup, a physics-informed framework that functions with uniform sparse sampling and imperfect environment priors.<n> Experimental results show the proposed RMsup achieves state-of-the-art performance both in RM construction and ISAC-related environment sensing.
arXiv Detail & Related papers (2025-11-29T09:00:12Z) - FAIM: Frequency-Aware Interactive Mamba for Time Series Classification [87.84511960413715]
Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition.<n>We propose FAIM, a lightweight Frequency-Aware Interactive Mamba model.<n>We show that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2025-11-26T08:36:33Z) - AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset [11.179608136803447]
Anomaly recognition plays a vital role in surveillance, transportation, healthcare and public safety.<n>Existing approaches rely solely on visual data, making them unreliable under challenging conditions.<n>The absence of large-scale synchronized audio-visual has hindered progress in multimodal anomaly recognition.
arXiv Detail & Related papers (2025-10-15T14:56:00Z) - Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection [53.689841037081834]
Ivan-ISTD is designed to address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD.<n>Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-14T07:48:31Z) - Ecologically Valid Benchmarking and Adaptive Attention: Scalable Marine Bioacoustic Monitoring [2.558238597112103]
GetNetUPAM is a nested cross-validation framework to model stability under realistic variability.<n>Data are partitioned into distinct site-year segments, preserving recording and ensuring each validation fold reflects a unique environmental subset.<n>ARPA-N achieves a 14.4% gain in average precision over DenseNet baselines and a log2-scale order-of-magnitude drop in variability across all metrics.
arXiv Detail & Related papers (2025-09-04T22:03:05Z) - DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection [49.9059941674531]
We propose DenoDet V2, which exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism.<n>DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half.
arXiv Detail & Related papers (2025-08-12T23:24:20Z) - Adaptive Control Attention Network for Underwater Acoustic Localization and Domain Adaptation [8.017203108408973]
Localizing acoustic sound sources in the ocean is a challenging task due to the complex and dynamic nature of the environment.<n>We propose a multi-branch network architecture designed to accurately predict the distance between a moving acoustic source and a receiver.<n>Our proposed method outperforms state-of-the-art (SOTA) approaches in similar settings.
arXiv Detail & Related papers (2025-06-20T18:13:30Z) - Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data.
We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising.
Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z) - From Environmental Sound Representation to Robustness of 2D CNN Models
Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.