Related papers: UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

URL: http://arxiv.org/abs/2506.09874v2
Date: Thu, 10 Jul 2025 19:47:47 GMT
Title: UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
Authors: Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet,
Abstract summary: We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio.<n>Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes.
Score: 17.559310386487493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.

Related papers

LoRP-TTS: Low-Rank Personalized Text-To-Speech [0.0]
Speech synthesis models convert written text into natural-sounding audio.<n>Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts.
arXiv Detail & Related papers (2025-02-11T14:00:12Z)
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance [9.87139502863569]
Koel-TTS is a suite of enhanced encoder-decoder Transformer TTS models.<n>We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models.
arXiv Detail & Related papers (2025-02-07T06:47:11Z)
On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z)
Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech. We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.