Related papers: PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

URL: http://arxiv.org/abs/2406.07823v1
Date: Wed, 12 Jun 2024 02:46:17 GMT
Title: PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
Authors: Trang Le, Daniel Lazar, Suyoun Kim, Shan Jiang, Duc Le, Adithya Sagar, Aleksandr Livshits, Ahmed Aly, Akshat Shrivastava,
Abstract summary: PRoDeliberation is a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions.
Score: 44.77985942208969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.

Related papers

AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z)
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding [21.682444278458433]
Speech Speculative Decoding (SSD) is a novel framework for autoregressive speech synthesis acceleration.<n>SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding.
arXiv Detail & Related papers (2025-05-21T11:17:04Z)
Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS) MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Speech recognition for air traffic control via feature learning and end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems. The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss. Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z)
ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems [15.527854608553824]
ATCSpeechNet is proposed to tackle the issue of translating communication speech into human-readable text in air traffic control systems. An end-to-end paradigm is developed to convert speech waveform into text directly, without any feature engineering or lexicon. Experimental results on the ATCSpeech corpus demonstrate that the proposed approach achieves a high performance with a very small labeled corpus.
arXiv Detail & Related papers (2021-02-17T02:27:09Z)
Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.