Related papers: Fast Text-to-Audio Generation with Adversarial Post-Training

Fast Text-to-Audio Generation with Adversarial Post-Training

URL: http://arxiv.org/abs/2505.08175v3
Date: Tue, 20 May 2025 02:54:49 GMT
Title: Fast Text-to-Audio Generation with Adversarial Post-Training
Authors: Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons,
Abstract summary: Text-to-audio systems are slow at inference time, making their latency unpractical for many creative applications.<n>We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation.
Score: 39.000388217500785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $\approx$12s of 44.1kHz stereo audio in $\approx$75ms on an H100, and $\approx$7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Whisfusion: Parallel ASR Decoding via a Diffusion Transformer [7.327454599174306]
Whisfusion is a framework to fuse a pre-trained Whisper encoder with a text diffusion decoder.<n>A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities.<n>Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny, and offers comparable latency on short audio.
arXiv Detail & Related papers (2025-08-09T17:20:54Z)
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows [13.130255838403002]
MeanAudio is a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE)<n>We demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation.
arXiv Detail & Related papers (2025-08-08T07:49:59Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion [23.250409921931492]
Rectified flow enhances inference speed by learning straight-line ordinary differential equation paths.<n>This approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts.<n>We propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model.
arXiv Detail & Related papers (2025-05-28T08:33:58Z)
Sample-Efficient Diffusion for Text-To-Speech Synthesis [31.372486998377966]
It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT) SESD achieves impressive results despite training on less than 1k hours of speech. It synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
arXiv Detail & Related papers (2024-09-01T20:34:36Z)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation [43.61383132919089]
Controllable music generation methods are critical for human-centered AI-based music creation. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control.
arXiv Detail & Related papers (2024-05-30T17:40:11Z)
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z)
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis [123.11530365315677]
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production.
arXiv Detail & Related papers (2023-08-31T15:41:40Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.