Related papers: Turbocharge Speech Understanding with Pilot Inference

Turbocharge Speech Understanding with Pilot Inference

URL: http://arxiv.org/abs/2311.17065v3
Date: Thu, 10 Oct 2024 20:04:17 GMT
Title: Turbocharge Speech Understanding with Pilot Inference
Authors: Rongxiang Wang, Felix Xiaozhu Lin,
Abstract summary: This paper sets to accelerate modern speech understanding on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.
Score: 0.9699101045941684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. While the approach is well-known, we address SU's unique challenges with novel techniques: (1) late contextualization, which executes a model's attentive encoder in parallel to the input ingestion; (2) pilot inference, which mitigates the SU pipeline's temporal load imbalance; (3) autoregression offramps, which evaluate offloading decisions based on pilot inferences and hypotheses. Our techniques are compatible with existing speech models, pipelines, and frameworks; they can be applied independently or in combination. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.

Related papers

PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models [20.212041940314016]
We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets.
arXiv Detail & Related papers (2025-04-05T08:31:10Z)
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training. We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z)
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [9.080650575731152]
PipeInfer is a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios. PipeInfer exhibits up to a 2.15$times$ improvement in generation speed over standard speculative inference.
arXiv Detail & Related papers (2024-07-16T14:52:02Z)
Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models [59.16287352266203]
We introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method for text-to-image (T2I) diffusion models. APTP learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores.
arXiv Detail & Related papers (2024-06-17T19:22:04Z)
ALTO: An Efficient Network Orchestrator for Compound AI Systems [20.880866765513066]
ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
arXiv Detail & Related papers (2024-03-07T08:30:26Z)
AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training [22.107070114339038]
We propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training. In particular, we propose a light-weight adaptive latency predictor to accurately estimate the latency of each layer at different devices. Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster.
arXiv Detail & Related papers (2023-11-10T02:18:33Z)
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition [1.2680687621338012]
Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
arXiv Detail & Related papers (2023-11-08T19:57:10Z)
Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side. By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample. We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.