Related papers: Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

URL: http://arxiv.org/abs/2008.03096v1
Date: Fri, 7 Aug 2020 11:48:05 GMT
Title: Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning
Authors: Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao
Abstract summary: Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. We propose a reinforcement learning based framework to train an agent to make this decision.
Score: 60.20205278845412
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in an incremental manner.

Related papers

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities. Video and audio are obtained at much higher rates than text and are roughly aligned in time. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. We propose a differentiable duration method for learning monotonic sequences between input and output. Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z)
Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z)
Using previous acoustic context to improve Text-to-Speech synthesis [30.885417054452905]
We leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio. We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio.
arXiv Detail & Related papers (2020-12-07T15:00:18Z)
Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction [10.516452073178511]
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem.
arXiv Detail & Related papers (2020-10-20T11:30:26Z)
End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner. Our proposed generator is feed-forward and thus efficient for both training and inference. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.