Personalized Predictive ASR for Latency Reduction in Voice Assistants
- URL: http://arxiv.org/abs/2305.13794v1
- Date: Tue, 23 May 2023 08:05:43 GMT
- Title: Personalized Predictive ASR for Latency Reduction in Voice Assistants
- Authors: Andreas Schwarz, Di He, Maarten Van Segbroeck, Mohammed Hethnawi,
Ariya Rastrow
- Abstract summary: We introduce predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance.
We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.
- Score: 29.237198363254752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize
prefetching to partially hide the latency of response generation. Prefetching
involves passing a preliminary ASR hypothesis to downstream systems in order to
prefetch and cache a response. If the final ASR hypothesis after endpoint
detection matches the preliminary one, the cached response can be delivered to
the user, thus saving latency. In this paper, we extend this idea by
introducing predictive automatic speech recognition, where we predict the full
utterance from a partially observed utterance, and prefetch the response based
on the predicted utterance. We introduce two personalization approaches and
investigate the tradeoff between potential latency gains from successful
predictions and the cost increase from failed predictions. We evaluate our
methods on an internal voice assistant dataset as well as the public SLURP
dataset.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Mitigating LLM Hallucinations via Conformal Abstention [70.83870602967625]
We develop a principled procedure for determining when a large language model should abstain from responding in a general domain.
We leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate)
Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets.
arXiv Detail & Related papers (2024-04-04T11:32:03Z) - Towards Reliable and Factual Response Generation: Detecting Unanswerable
Questions in Information-Seeking Conversations [16.99952884041096]
Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems.
We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response.
Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate.
arXiv Detail & Related papers (2024-01-21T10:15:36Z) - Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals
using Self Supervised Speech Representations [21.237026538221404]
techniques for non-intrusive prediction of SQ ratings are extended to the prediction of intelligibility for hearing-impaired users.
It is found that self-supervised representations are useful as input features to non-intrusive prediction models.
arXiv Detail & Related papers (2023-07-25T11:42:52Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Towards Improved Room Impulse Response Estimation for Speech Recognition [53.04440557465013]
We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of far-field automatic speech recognition (ASR)
We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators.
We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features.
arXiv Detail & Related papers (2022-11-08T00:40:27Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Real-time Caller Intent Detection In Human-Human Customer Support Spoken
Conversations [10.312382727352823]
Agent assistance during human-human customer support spoken interactions requires triggering based on the caller's intent (reason for call)
The goal is for a system to detect the caller's intent at the time the agent would have been able to detect it (Intent Boundary)
Recent work on voice assistants has used incremental real-time predictions at a word-by-word level to detect intent before the end of a command.
arXiv Detail & Related papers (2022-08-14T07:50:23Z) - Progressive End-to-End Object Detection in Crowded Scenes [96.92416613336096]
Previous query-based detectors suffer from two drawbacks: first, multiple predictions will be inferred for a single object, typically in crowded scenes; second, the performance saturates as the depth of the decoding stage increases.
We propose a progressive predicting method to address the above issues. Specifically, we first select accepted queries to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions.
Experiments show that our method can significantly boost the performance of query-based detectors in crowded scenes.
arXiv Detail & Related papers (2022-03-15T06:12:00Z) - Representation Learning for Sequence Data with Deep Autoencoding
Predictive Components [96.42805872177067]
We propose a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space.
We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step.
We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
arXiv Detail & Related papers (2020-10-07T03:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.