Related papers: Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

URL: http://arxiv.org/abs/2401.02417v1
Date: Thu, 4 Jan 2024 18:59:31 GMT
Title: Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition
Authors: David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Bj\"orn Hoffmeister
Abstract summary: We introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new large-scale semi-synthetic meta-dataset of audio task-oriented dialogues. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines.
Score: 19.475314134504504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While word error rates of automatic speech recognition (ASR) systems have consistently fallen, natural language understanding (NLU) applications built on top of ASR systems still attribute significant numbers of failures to low-quality speech recognition results. Existing assistant systems collect large numbers of these unsuccessful interactions, but these systems usually fail to learn from these interactions, even in an offline fashion. In this work, we introduce CLC: Contrastive Learning for Conversations, a family of methods for contrastive fine-tuning of models in a self-supervised fashion, making use of easily detectable artifacts in unsuccessful conversations with assistants. We demonstrate that our CLC family of approaches can improve the performance of ASR models on OD3, a new public large-scale semi-synthetic meta-dataset of audio task-oriented dialogues, by up to 19.2%. These gains transfer to real-world systems as well, where we show that CLC can help to improve performance by up to 6.7% over baselines. We make OD3 publicly available at https://github.com/amazon-science/amazon-od3 .

Related papers

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models [23.278483193586887]
We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task.<n>Our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM.
arXiv Detail & Related papers (2025-05-31T07:26:44Z)
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction [110.38946048535033]
This paper introduces Step-Audio, the first production-ready open-source solution for speech recognition. Key contributions include: 1) a unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex
arXiv Detail & Related papers (2025-02-17T15:58:56Z)
Unifying Global and Near-Context Biasing in a Single Trie Pass [11.277273712268897]
We propose an unexplored combination of an NE bias list and a word-level n-gram language model (LM)<n>We show that the proposed combination of keyword biasing and n-gram LM improves entity recognition by up to 32% relative and reduces overall WER by up to a 12% relative.
arXiv Detail & Related papers (2024-09-20T13:53:37Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models [41.37311266840156]
We explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We train classifiers using only acoustic information obtained from the audio waveform. We take the decoder outputs of an automatic speech recognition system, such as 1-best hypotheses, as input features to a large language model.
arXiv Detail & Related papers (2024-03-21T14:44:03Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing. Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy. We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z)
Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding [28.441725610692714]
This paper focuses on learning utterance representations that are robust to ASR errors using a contrastive objective. Experiments on three benchmark datasets demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2022-05-02T07:21:21Z)
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation. Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z)
Multi-task Language Modeling for Improving Speech Recognition of Rare Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.