Efficient Incremental Text-to-Speech on GPUs
- URL: http://arxiv.org/abs/2211.13939v1
- Date: Fri, 25 Nov 2022 07:43:45 GMT
- Title: Efficient Incremental Text-to-Speech on GPUs
- Authors: Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai
- Abstract summary: We present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic.
Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU.
- Score: 1.35346836945515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incremental text-to-speech, also known as streaming TTS, has been
increasingly applied to online speech applications that require ultra-low
response latency to provide an optimal user experience. However, most of the
existing speech synthesis pipelines deployed on GPU are still non-incremental,
which uncovers limitations in high-concurrency scenarios, especially when the
pipeline is built with end-to-end neural network models. To address this issue,
we present a highly efficient approach to perform real-time incremental TTS on
GPUs with Instant Request Pooling and Module-wise Dynamic Batching.
Experimental results demonstrate that the proposed method is capable of
producing high-quality speech with a first-chunk latency lower than 80ms under
100 QPS on a single NVIDIA A10 GPU and significantly outperforms the
non-incremental twin in both concurrency and latency. Our work reveals the
effectiveness of high-performance incremental TTS on GPUs.
Related papers
- Fastrack: Fast IO for Secure ML using GPU TEEs [7.758531952461963]
GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions.
CPU-to-GPU communication overheads significantly hinder performance.
This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads.
We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission.
arXiv Detail & Related papers (2024-10-20T01:00:33Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - SPEED: Streaming Partition and Parallel Acceleration for Temporal
Interaction Graph Embedding [22.68416593780539]
We introduce a novel training approach namely Streaming Edge Partitioning and Parallel Acceleration for Temporal Interaction Graph Embedding.
Our method can achieve a good balance in computing resources, computing time, and downstream task performance.
Empirical validation across 7 real-world datasets demonstrates the potential to expedite training speeds by a factor of up to 19.29x.
arXiv Detail & Related papers (2023-08-27T15:11:44Z) - DFX: A Low-latency Multi-FPGA Appliance for Accelerating
Transformer-based Text Generation [7.3619135783046]
We present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model end-to-end with low latency and high throughput.
We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources.
arXiv Detail & Related papers (2022-09-22T05:59:59Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - AxoNN: An asynchronous, message-driven parallel framework for
extreme-scale deep learning [1.5301777464637454]
AxoNN is a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU.
By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times.
arXiv Detail & Related papers (2021-10-25T14:43:36Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [46.20949184826173]
This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms.
Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
arXiv Detail & Related papers (2020-03-30T14:16:23Z) - Efficient Video Semantic Segmentation with Labels Propagation and
Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach.
We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next.
On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.