End-to-end speech-to-dialog-act recognition
- URL: http://arxiv.org/abs/2004.11419v2
- Date: Tue, 28 Jul 2020 22:12:17 GMT
- Title: End-to-end speech-to-dialog-act recognition
- Authors: Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya
Kawahara
- Abstract summary: We present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process.
In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer.
The entire network is fine-tuned in an end-to-end manner.
- Score: 38.58540444573232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding, which extracts intents and/or semantic
concepts in utterances, is conventionally formulated as a post-processing of
automatic speech recognition. It is usually trained with oracle transcripts,
but needs to deal with errors by ASR. Moreover, there are acoustic features
which are related with intents but not represented with the transcripts. In
this paper, we present an end-to-end model which directly converts speech into
dialog acts without the deterministic transcription process. In the proposed
model, the dialog act recognition network is conjunct with an acoustic-to-word
ASR model at its latent layer before the softmax layer, which provides a
distributed representation of word-level ASR decoding information. Then, the
entire network is fine-tuned in an end-to-end manner. This allows for stable
training as well as robustness against ASR errors. The model is further
extended to conduct DA segmentation jointly. Evaluations with the Switchboard
corpus demonstrate that the proposed method significantly improves dialog act
recognition accuracy from the conventional pipeline framework.
Related papers
- Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning [6.363223418619587]
We introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy.
Based on the evaluation of speech dialogues, our method shows superior results compared to baselines.
arXiv Detail & Related papers (2024-08-12T10:21:09Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts [44.16141704545044]
We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data.
The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
arXiv Detail & Related papers (2023-06-01T14:56:19Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.