Related papers: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

URL: http://arxiv.org/abs/2508.06098v2
Date: Wed, 22 Oct 2025 09:22:42 GMT
Title: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Authors: Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen,
Abstract summary: MeanAudio is a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE)<n>We demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation.
Score: 13.130255838403002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.

Related papers

Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
FCPE: A Fast Context-based Pitch Estimation Model [10.788664167503676]
We propose a fast context-based pitch estimation model that captures mel spectrogram features while maintaining low computational cost and robust noise tolerance.<n>Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods.
arXiv Detail & Related papers (2025-09-18T16:50:09Z)
MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation [12.665130073406651]
A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency.<n>We introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity.<n>We demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality.
arXiv Detail & Related papers (2025-09-08T07:15:21Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z)
AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion [23.250409921931492]
Rectified flow enhances inference speed by learning straight-line ordinary differential equation paths.<n>This approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts.<n>We propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model.
arXiv Detail & Related papers (2025-05-28T08:33:58Z)
Fast Text-to-Audio Generation with Adversarial Post-Training [39.000388217500785]
Text-to-audio systems are slow at inference time, making their latency unpractical for many creative applications.<n>We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation.
arXiv Detail & Related papers (2025-05-13T02:25:47Z)
FlowTS: Time Series Generation via Rectified Flow [67.41208519939626]
FlowTS is an ODE-based model that leverages rectified flow with straight-line transport in probability space.<n>For unconditional setting, FlowTS achieves state-of-the-art performance, with context FID scores of 0.019 and 0.011 on Stock and ETTh datasets.<n>For conditional setting, we have achieved superior performance in solar forecasting.
arXiv Detail & Related papers (2024-11-12T03:03:23Z)
FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner [70.90505084288057]
Flow-based models tend to produce a straighter sampling trajectory during the sampling process. We introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img)
arXiv Detail & Related papers (2024-09-26T17:59:51Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech. Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z)
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching [14.7974342537458]
VoiceFlow is an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart.
arXiv Detail & Related papers (2023-09-10T13:47:39Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)
Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis. The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain. It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.