Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR
- URL: http://arxiv.org/abs/2207.01039v1
- Date: Sun, 3 Jul 2022 13:32:24 GMT
- Title: Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR
- Authors: Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma
- Abstract summary: We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
- Score: 25.75615870266786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging context information is an intuitive idea to improve performance on
conversational automatic speech recognition(ASR). Previous works usually adopt
recognized hypotheses of historical utterances as preceding context, which may
bias the current recognized hypothesis due to the inevitable
historicalrecognition errors. To avoid this problem, we propose an
audio-textual cross-modal representation extractor to learn contextual
representations directly from preceding speech. Specifically, it consists of
two modal-related encoders, extracting high-level latent features from speech
and the corresponding text, and a cross-modal encoder, which aims to learn the
correlation between speech and text. We randomly mask some input tokens and
input sequences of each modality. Then a token-missing or modal-missing
prediction with a modal-level CTC loss on the cross-modal encoder is performed.
Thus, the model captures not only the bi-directional context dependencies in a
specific modality but also relationships between different modalities. Then,
during the training of the conversational ASR system, the extractor will be
frozen to extract the textual representation of preceding speech, while such
representation is used as context fed to the ASR decoder through attention
mechanism. The effectiveness of the proposed approach is validated on several
Mandarin conversation corpora and the highest character error rate (CER)
reduction up to 16% is achieved on the MagicData dataset.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning [6.363223418619587]
We introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy.
Based on the evaluation of speech dialogues, our method shows superior results compared to baselines.
arXiv Detail & Related papers (2024-08-12T10:21:09Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation [27.926862030684926]
We introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation.
Our approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input.
By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss.
arXiv Detail & Related papers (2023-10-22T11:57:33Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Towards Relation Extraction From Speech [56.36416922396724]
We propose a new listening information extraction task, i.e., speech relation extraction.
We construct the training dataset for speech relation extraction via text-to-speech systems, and we construct the testing dataset via crowd-sourcing with native English speakers.
We conduct comprehensive experiments to distinguish the challenges in speech relation extraction, which may shed light on future explorations.
arXiv Detail & Related papers (2022-10-17T05:53:49Z) - Conversational Speech Recognition By Learning Conversation-level
Characteristics [25.75615870266786]
This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework.
Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate (CER) reduction.
arXiv Detail & Related papers (2022-02-16T04:33:05Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - End-to-end speech-to-dialog-act recognition [38.58540444573232]
We present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process.
In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer.
The entire network is fine-tuned in an end-to-end manner.
arXiv Detail & Related papers (2020-04-23T18:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.