Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning
- URL: http://arxiv.org/abs/2008.03096v1
- Date: Fri, 7 Aug 2020 11:48:05 GMT
- Title: Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning
- Authors: Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh,
Marlene Staib, Alexandra Torresquintero, Jiameng Gao
- Abstract summary: Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
- Score: 60.20205278845412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern approaches to text to speech require the entire input character
sequence to be processed before any audio is synthesised. This latency limits
the suitability of such models for time-sensitive tasks like simultaneous
interpretation. Interleaving the action of reading a character with that of
synthesising audio reduces this latency. However, the order of this sequence of
interleaved actions varies across sentences, which raises the question of how
the actions should be chosen. We propose a reinforcement learning based
framework to train an agent to make this decision. We compare our performance
against that of deterministic, rule-based systems. Our results demonstrate that
our agent successfully balances the trade-off between the latency of audio
generation and the quality of synthesised audio. More broadly, we show that
neural sequence-to-sequence models can be adapted to run in an incremental
manner.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - Using previous acoustic context to improve Text-to-Speech synthesis [30.885417054452905]
We leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio.
We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio.
arXiv Detail & Related papers (2020-12-07T15:00:18Z) - Replacing Human Audio with Synthetic Audio for On-device Unspoken
Punctuation Prediction [10.516452073178511]
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features.
We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem.
arXiv Detail & Related papers (2020-10-20T11:30:26Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.