Triple M: A Practical Neural Text-to-speech System With Multi-guidance
Attention And Multi-band Multi-time Lpcnet
- URL: http://arxiv.org/abs/2102.00247v1
- Date: Sat, 30 Jan 2021 15:38:36 GMT
- Title: Triple M: A Practical Neural Text-to-speech System With Multi-guidance
Attention And Multi-band Multi-time Lpcnet
- Authors: Shilun Lin, Xinhui Li, Li Lu
- Abstract summary: We propose a practical neural text-to-speech system, named Triple M, consisting of a seq2seq model with multi-guidance attention and a multi-band multi-time LPCNet.
The former uses alignment results of different attention mechanisms to guide the learning of the basic attention mechanism, and only retains the basic attention mechanism during inference.
The latter reduces the computational complexity of LPCNet through combining multi-band and multi-time strategies.
- Score: 4.552464397842643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although the sequence-to-sequence network with attention mechanism and neural
vocoder has made great progress in the quality of speech synthesis, there are
still some problems to be solved in large-scale real-time applications. For
example, to avoid long sentence alignment failure while maintaining rich
prosody, and to reduce the computational overhead while ensuring perceptual
quality. In order to address these issues, we propose a practical neural
text-to-speech system, named Triple M, consisting of a seq2seq model with
multi-guidance attention and a multi-band multi-time LPCNet. The former uses
alignment results of different attention mechanisms to guide the learning of
the basic attention mechanism, and only retains the basic attention mechanism
during inference. This approach can improve the performance of the
text-to-feature module by absorbing the advantages of all guidance attention
methods without modifying the basic inference architecture. The latter reduces
the computational complexity of LPCNet through combining multi-band and
multi-time strategies. The multi-band strategy enables the LPCNet to generate
sub-band signals in each inference. By predicting the sub-band signals of
adjacent time in one forward operation, the multi-time strategy further
decreases the number of inferences required. Due to the multi-band and
multi-time strategy, the vocoder speed is increased by 2.75x on a single CPU
and the MOS (mean opinion score) degradation is slight.
Related papers
- Multi-task Photonic Reservoir Computing: Wavelength Division Multiplexing for Parallel Computing with a Silicon Microring Resonator [0.0]
We numerically show the use of time and wavelength division multiplexing (WDM) to solve four independent tasks at the same time in a single photonic chip.
The footprint of the system is reduced by using time-division multiplexing of the nodes that act as the neurons of the studied neural network scheme.
arXiv Detail & Related papers (2024-07-30T20:54:07Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Deep Reinforcement Learning for Uplink Scheduling in NOMA-URLLC Networks [7.182684187774442]
This article addresses the problem of Ultra Reliable Low Communications (URLLC) in wireless networks, a framework with particularly stringent constraints imposed by many Internet of Things (IoT) applications from diverse sectors.
We propose a novel Deep Reinforcement Learning (DRL) scheduling algorithm, to solve the Non-Orthogonal Multiple Access (NOMA) uplink URLLC scheduling problem involving strict deadlines.
arXiv Detail & Related papers (2023-08-28T12:18:02Z) - Multi-Loss Convolutional Network with Time-Frequency Attention for
Speech Enhancement [16.701596804113553]
We explore self-attention in the DPCRN module and design a model called Multi-Loss Convolutional Network with Time-Frequency Attention(MNTFA) for speech enhancement.
Compared to DPRNN, axial self-attention greatly reduces the need for memory and computation.
We propose a joint training method of a multi-resolution STFT loss and a WavLM loss using a pre-trained WavLM network.
arXiv Detail & Related papers (2023-06-15T08:48:19Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - ZoPE: A Fast Optimizer for ReLU Networks with Low-Dimensional Inputs [30.34898838361206]
We present an algorithm called ZoPE that solves optimization problems over the output of feedforward ReLU networks with low-dimensional inputs.
Using ZoPE, we observe a $25times speedup on property 1 of the ACAS Xu neural network verification benchmark and an $85times speedup on a set of linear optimization problems.
arXiv Detail & Related papers (2021-06-09T18:36:41Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.