Do We Still Need Automatic Speech Recognition for Spoken Language
Understanding?
- URL: http://arxiv.org/abs/2111.14842v1
- Date: Mon, 29 Nov 2021 15:13:36 GMT
- Title: Do We Still Need Automatic Speech Recognition for Spoken Language
Understanding?
- Authors: Lasse Borgholt, Jakob Drachmann Havtorn, Mostafa Abdou, Joakim Edin,
Lars Maal{\o}e, Anders S{\o}gaard, Christian Igel
- Abstract summary: We show that learned speech features are superior to ASR transcripts on three classification tasks.
We highlight the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
- Score: 14.575551366682872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding (SLU) tasks are usually solved by first
transcribing an utterance with automatic speech recognition (ASR) and then
feeding the output to a text-based model. Recent advances in self-supervised
representation learning for speech data have focused on improving the ASR
component. We investigate whether representation learning for speech has
matured enough to replace ASR in SLU. We compare learned speech features from
wav2vec 2.0, state-of-the-art ASR transcripts, and the ground truth text as
input for a novel speech-based named entity recognition task, a cardiac arrest
detection task on real-world emergency calls and two existing SLU benchmarks.
We show that learned speech features are superior to ASR transcripts on three
classification tasks. For machine translation, ASR transcripts are still the
better choice. We highlight the intrinsic robustness of wav2vec 2.0
representations to out-of-vocabulary words as key to better performance.
Related papers
- Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - A Textless Metric for Speech-to-Speech Comparison [20.658229254191266]
We introduce a new and simple method for comparing speech utterances without relying on text transcripts.
Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units.
arXiv Detail & Related papers (2022-10-21T09:28:54Z) - WaBERT: A Low-resource End-to-end Model for Spoken Language
Understanding and Speech-to-BERT Alignment [2.7505260301752763]
We propose a novel end-to-end model combining the speech model and the language model for SLU tasks.
WaBERT is based on the pre-trained speech and language model, hence training from scratch is not needed.
arXiv Detail & Related papers (2022-04-22T02:14:40Z) - AISHELL-NER: Named Entity Recognition from Chinese Speech [54.434118596263126]
We introduce a new dataset AISEHLL-NER for NER from Chinese speech.
The results demonstrate that the performance could be improved by combining-aware ASR and pretrained NER tagger.
arXiv Detail & Related papers (2022-02-17T09:18:48Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.