Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud
to Edge
- URL: http://arxiv.org/abs/2203.14416v1
- Date: Sun, 27 Mar 2022 23:56:52 GMT
- Title: Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud
to Edge
- Authors: Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin
Osipov, June Sig Sung
- Abstract summary: Bunched LPCNet2 provides efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices.
Experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B.
- Score: 3.612475016403612
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-Speech (TTS) services that run on edge devices have many advantages
compared to cloud TTS, e.g., latency and privacy issues. However, neural
vocoders with a low complexity and small model footprint inevitably generate
annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet
architecture that provides highly efficient performance in high-quality for
cloud servers and in a low-complexity for low-resource edge devices. Single
logistic distribution achieves computational efficiency, and insightful tricks
reduce the model footprint while maintaining speech quality. A DualRate
architecture, which generates a lower sampling rate from a prosody model, is
also proposed to reduce maintenance costs. The experiments demonstrate that
Bunched LPCNet2 generates satisfactory speech quality with a model footprint of
1.1MB while operating faster than real-time on a RPi 3B. Our audio samples are
available at https://srtts.github.io/bunchedLPCNet2.
Related papers
- A Real-Time Voice Activity Detection Based On Lightweight Neural [4.589472292598182]
Voice activity detection (VAD) is the task of detecting speech in an audio stream.
Recent neural network-based VADs have alleviated the degradation of performance to some extent.
We propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU.
arXiv Detail & Related papers (2024-05-27T03:31:16Z) - sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency [3.119625275101153]
System is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation.
Full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
arXiv Detail & Related papers (2021-11-17T11:46:43Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems [18.480490920718367]
LPCNet is an efficient vocoder that combines linear prediction and deep neural network modules to keep the computational complexity low.
We present two techniques to further reduce it's complexity, aiming for a low-cost LPCNet vocoder-based neural Text-to-Speech (TTS) System.
arXiv Detail & Related papers (2020-08-11T08:15:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.