Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
- URL: http://arxiv.org/abs/2204.00400v1
- Date: Fri, 1 Apr 2022 12:47:45 GMT
- Title: Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
- Authors: Andreas Triantafyllopoulos, Johannes Wagner, Hagen Wierstorf,
Maximilian Schmitt, Uwe Reichel, Florian Eyben, Felix Burkhardt, Bj\"orn W.
Schuller
- Abstract summary: We investigate the extent in which linguistic information is exploited during speech emotion recognition fine-tuning.
We synthesise prosodically neutral speech utterances while varying the sentiment of the text.
Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers.
- Score: 7.81884995637243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large, pre-trained neural networks consisting of self-attention layers
(transformers) have recently achieved state-of-the-art results on several
speech emotion recognition (SER) datasets. These models are typically
pre-trained in self-supervised manner with the goal to improve automatic speech
recognition performance -- and thus, to understand linguistic information. In
this work, we investigate the extent in which this information is exploited
during SER fine-tuning. Using a reproducible methodology based on open-source
tools, we synthesise prosodically neutral speech utterances while varying the
sentiment of the text. Valence predictions of the transformer model are very
reactive to positive and negative sentiment content, as well as negations, but
not to intensifiers or reducers, while none of those linguistic features impact
arousal or dominance. These findings show that transformers can successfully
leverage linguistic information to improve their valence predictions, and that
linguistic analysis should be included in their testing.
Related papers
- Bias-Free Sentiment Analysis through Semantic Blinding and Graph Neural Networks [0.0]
The SProp GNN relies exclusively on syntactic structures and word-level emotional cues to predict emotions in text.
By semantically blinding the model to information about specific words, it is robust to biases such as political or gender bias.
The SProp GNN shows performance superior to lexicon-based alternatives on two different prediction tasks, and across two languages.
arXiv Detail & Related papers (2024-11-19T13:23:53Z) - A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias.
Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions.
This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Enhancing expressivity transfer in textless speech-to-speech translation [0.0]
Existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages.
This study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings.
We demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language.
arXiv Detail & Related papers (2023-10-11T08:07:22Z) - Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Analysis of the Evolution of Advanced Transformer-Based Language Models:
Experiments on Opinion Mining [0.5735035463793008]
This paper studies the behaviour of the cutting-edge Transformer-based language models on opinion mining.
Our comparative study shows leads and paves the way for production engineers regarding the approach to focus on.
arXiv Detail & Related papers (2023-08-07T01:10:50Z) - Color Overmodification Emerges from Data-Driven Learning and Pragmatic
Reasoning [53.088796874029974]
We show that speakers' referential expressions depart from communicative ideals in ways that help illuminate the nature of pragmatic language use.
By adopting neural networks as learning agents, we show that overmodification is more likely with environmental features that are infrequent or salient.
arXiv Detail & Related papers (2022-05-18T18:42:43Z) - Dawn of the transformer era in speech emotion recognition: closing the
valence gap [9.514396745161793]
We investigate the influence of model size and pre-training data on downstream performance.
We fine-tune several pre-trained variants of wav2vec 2.0 and HuBERT and test cross-corpus generalisation.
Our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline.
arXiv Detail & Related papers (2022-03-14T13:21:47Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process.
We propose a framework that we call controllable grounded response generation (CGRG)
We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.