Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2401.02417v1
- Date: Thu, 4 Jan 2024 18:59:31 GMT
- Title: Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic
Speech Recognition
- Authors: David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Bj\"orn
Hoffmeister
- Abstract summary: We introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion.
We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new large-scale semi-synthetic meta-dataset of audio task-oriented dialogues.
These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines.
- Score: 19.475314134504504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While word error rates of automatic speech recognition (ASR) systems have
consistently fallen, natural language understanding (NLU) applications built on
top of ASR systems still attribute significant numbers of failures to
low-quality speech recognition results. Existing assistant systems collect
large numbers of these unsuccessful interactions, but these systems usually
fail to learn from these interactions, even in an offline fashion. In this
work, we introduce CLC: Contrastive Learning for Conversations, a family of
methods for contrastive fine-tuning of models in a self-supervised fashion,
making use of easily detectable artifacts in unsuccessful conversations with
assistants. We demonstrate that our CLC family of approaches can improve the
performance of ASR models on OD3, a new public large-scale semi-synthetic
meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains
transfer to real-world systems as well, where we show that CLC can help to
improve performance by up to 6.7% over baselines. We make OD3 publicly
available at https://github.com/amazon-science/amazon-od3 .
Related papers
- CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models [23.278483193586887]
We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task.<n>Our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM.
arXiv Detail & Related papers (2025-05-31T07:26:44Z) - Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction [110.38946048535033]
This paper introduces Step-Audio, the first production-ready open-source solution for speech recognition.
Key contributions include: 1) a unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex
arXiv Detail & Related papers (2025-02-17T15:58:56Z) - Unifying Global and Near-Context Biasing in a Single Trie Pass [11.277273712268897]
We propose an unexplored combination of an NE bias list and a word-level n-gram language model (LM)<n>We show that the proposed combination of keyword biasing and n-gram LM improves entity recognition by up to 32% relative and reduces overall WER by up to a 12% relative.
arXiv Detail & Related papers (2024-09-20T13:53:37Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase.
We train classifiers using only acoustic information obtained from the audio waveform.
We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Contrastive Learning for Improving ASR Robustness in Spoken Language
Understanding [28.441725610692714]
This paper focuses on learning utterance representations that are robust to ASR errors using a contrastive objective.
Experiments on three benchmark datasets demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2022-05-02T07:21:21Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z) - Multi-task Language Modeling for Improving Speech Recognition of Rare
Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.