On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era
- URL: http://arxiv.org/abs/2104.10121v1
- Date: Tue, 20 Apr 2021 17:10:01 GMT
- Title: On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era
- Authors: Shahin Amiriparian (1), Artem Sokolov (2,3), Ilhan Aslan (2), Lukas
Christ (1), Maurice Gerczuk (1), Tobias H\"ubner (1), Dmitry Lamanov (2),
Manuel Milling (1), Sandra Ottl (1), Ilya Poduremennykh (2), Evgeniy Shuranov
(2,4), Bj\"orn W. Schuller (1,5) ((1) EIHW -- Chair of Embedded Intelligence
for Health Care and Wellbeing, University of Augsburg, Germany, (2) Huawei
Technologies, (3) HSE University, Nizhniy Novgorod, Russia, (4) ITMO
University, Saint Petersburg, Russia)
- Abstract summary: We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text encodings from automatic speech recognition (ASR) transcripts and audio
representations have shown promise in speech emotion recognition (SER) ever
since. Yet, it is challenging to explain the effect of each information stream
on the SER systems. Further, more clarification is required for analysing the
impact of ASR's word error rate (WER) on linguistic emotion recognition per se
and in the context of fusion with acoustic information exploitation in the age
of deep ASR systems. In order to tackle the above issues, we create transcripts
from the original speech by applying three modern ASR systems, including an
end-to-end model trained with recurrent neural network-transducer loss, a model
with connectionist temporal classification loss, and a wav2vec framework for
self-supervised learning. Afterwards, we use pre-trained textual models to
extract text representations from the ASR outputs and the gold standard. For
extraction and learning of acoustic speech features, we utilise openSMILE,
openXBoW, DeepSpectrum, and auDeep. Finally, we conduct decision-level fusion
on both information streams -- acoustics and linguistics. Using the best
development configuration, we achieve state-of-the-art unweighted average
recall values of $73.6\,\%$ and $73.8\,\%$ on the speaker-independent
development and test partitions of IEMOCAP, respectively.
Related papers
- Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition [110.8431434620642]
We introduce the generative speech transcription error correction (GenSEC) challenge.
This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.
We discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
arXiv Detail & Related papers (2024-09-15T16:32:49Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - End-to-end speech-to-dialog-act recognition [38.58540444573232]
We present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process.
In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer.
The entire network is fine-tuned in an end-to-end manner.
arXiv Detail & Related papers (2020-04-23T18:44:27Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.