Related papers: Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

URL: http://arxiv.org/abs/2601.17085v1
Date: Fri, 23 Jan 2026 07:57:05 GMT
Title: Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration
Authors: Esther Sun, Abinay Reddy Naini, Carlos Busso,
Abstract summary: This paper presents a comprehensive investigation of discrete tokens for speech emotion recognition (SER)<n>We quantify performance degradation across different layer configurations and k-means quantization granularities.<n>We propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues.
Score: 28.470758433815423
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.

Related papers

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs [59.230858581944425]
Two dominant approaches have emerged for speech processing: discrete tokens and continuous features.<n>We compare self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings.<n>Our findings reveal that continuous features generally outperform discrete tokens in various tasks.
arXiv Detail & Related papers (2025-08-25T10:16:07Z)
Feature Hallucination for Self-supervised Action Recognition [37.20267786858476]
We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames.<n>Our framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
arXiv Detail & Related papers (2025-06-25T11:50:23Z)
Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture [2.3272964989267626]
We propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification.<n>Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models.
arXiv Detail & Related papers (2025-05-05T02:31:11Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.<n> SSL features conflict with traditional spectral features like FBanks in terms of update directions.<n>We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z)
A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models [46.298114175792584]
We present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks. Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding.
arXiv Detail & Related papers (2024-11-13T16:20:20Z)
Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features. Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z)
Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information. We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.