Explaining Speech Classification Models via Word-Level Audio Segments
and Paralinguistic Features
- URL: http://arxiv.org/abs/2309.07733v1
- Date: Thu, 14 Sep 2023 14:12:34 GMT
- Title: Explaining Speech Classification Models via Word-Level Audio Segments
and Paralinguistic Features
- Authors: Eliana Pastor, Alkis Koudounas, Giuseppe Attanasio, Dirk Hovy, Elena
Baralis
- Abstract summary: We introduce a new approach to explain speech classification models.
We generate easy-to-interpret explanations via input perturbation on two information levels.
We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian.
- Score: 35.31998003091635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in eXplainable AI (XAI) have provided new insights into how
models for vision, language, and tabular data operate. However, few approaches
exist for understanding speech models. Existing work focuses on a few spoken
language understanding (SLU) tasks, and explanations are difficult to interpret
for most users. We introduce a new approach to explain speech classification
models. We generate easy-to-interpret explanations via input perturbation on
two information levels. 1) Word-level explanations reveal how each word-related
audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody
and background noise) answer the counterfactual: ``What would the model
prediction be if we edited the audio signal in this way?'' We validate our
approach by explaining two state-of-the-art SLU models on two speech
classification tasks in English and Italian. Our findings demonstrate that the
explanations are faithful to the model's inner workings and plausible to
humans. Our method and findings pave the way for future research on
interpreting speech models.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - What Do Self-Supervised Speech and Speaker Models Learn? New Findings
From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations.
Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - A Brief Overview of Unsupervised Neural Speech Representation Learning [12.850357461259197]
We review the development of unsupervised representation learning for speech over the last decade.
We identify two primary model categories: self-supervised methods and probabilistic latent variable models.
arXiv Detail & Related papers (2022-03-01T11:15:35Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.