Related papers: Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

URL: http://arxiv.org/abs/2204.00400v1
Date: Fri, 1 Apr 2022 12:47:45 GMT
Title: Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
Authors: Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf, Maximilian Schmitt, Uwe Reichel, Florian Eyben, Felix Burkhardt, Bj\"orn W. Schuller
Abstract summary: We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning. We synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
Score: 7.81884995637243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.

Related papers

Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis [2.0499240875882]
We introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance. These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.
arXiv Detail & Related papers (2025-02-26T18:32:15Z)
Bias-Free Sentiment Analysis through Semantic Blinding and Graph Neural Networks [0.0]
The SProp GNN relies exclusively on syntactic structures and word-level emotional cues to predict emotions in text. By semantically blinding the model to information about specific words, it is robust to biases such as political or gender bias. The SProp GNN shows performance superior to lexicon-based alternatives on two different prediction tasks, and across two languages.
arXiv Detail & Related papers (2024-11-19T13:23:53Z)
A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
Enhancing expressivity transfer in textless speech-to-speech translation [0.0]
Existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. This study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings. We demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language.
arXiv Detail & Related papers (2023-10-11T08:07:22Z)
Acoustic and linguistic representations for speech continuous emotion recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Our experiments confirm the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z)
Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z)
Analysis of the Evolution of Advanced Transformer-Based Language Models: Experiments on Opinion Mining [0.5735035463793008]
This paper studies the behaviour of the cutting-edge Transformer-based language models on opinion mining. Our comparative study shows leads and paves the way for production engineers regarding the approach to focus on.
arXiv Detail & Related papers (2023-08-07T01:10:50Z)
Color Overmodification Emerges from Data-Driven Learning and Pragmatic Reasoning [53.088796874029974]
We show that speakers' referential expressions depart from communicative ideals in ways that help illuminate the nature of pragmatic language use. By adopting neural networks as learning agents, we show that overmodification is more likely with environmental features that are infrequent or salient.
arXiv Detail & Related papers (2022-05-18T18:42:43Z)
Dawn of the transformer era in speech emotion recognition: closing the valence gap [9.514396745161793]
We investigate the influence of model size and pre-training data on downstream performance. We fine-tune several pre-trained variants of wav2vec 2.0 and HuBERT and test cross-corpus generalisation. Our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline.
arXiv Detail & Related papers (2022-03-14T13:21:47Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. We propose a framework that we call controllable grounded response generation (CGRG) We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.