Related papers: SpeechOp: Inference-Time Task Composition for Generative Speech Processing

SpeechOp: Inference-Time Task Composition for Generative Speech Processing

URL: http://arxiv.org/abs/2509.14298v1
Date: Wed, 17 Sep 2025 05:05:55 GMT
Title: SpeechOp: Inference-Time Task Composition for Generative Speech Processing
Authors: Justin Lovelace, Rithesh Kumar, Jiaqi Su, Ke Chen, Kilian Q Weinberger, Zeyu Jin,
Abstract summary: SpeechOp is a universal speech processor capable of performing a wide range of speech tasks.<n>Implicit Task Composition helps SpeechOp's enhancement via our principled inference-time task composition.
Score: 41.5053493629172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop

Related papers

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis [57.5830191022097]
A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation.<n>We adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs.<n>Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines.
arXiv Detail & Related papers (2025-11-07T17:07:56Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching [3.05024318465243]
OZSpeech is the first TTS method to explore optimal transport conditional flow matching with one-step sampling.<n>Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute.<n> Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation.
arXiv Detail & Related papers (2025-05-19T07:31:55Z)
GRASS: Unified Generation Model for Speech-to-Semantic Tasks [7.044414457214718]
We introduce a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. Our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
arXiv Detail & Related papers (2023-09-06T06:44:26Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data. For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.