Related papers: PitchFlower: A flow-based neural audio codec with pitch controllability

PitchFlower: A flow-based neural audio codec with pitch controllability

URL: http://arxiv.org/abs/2510.25566v1
Date: Wed, 29 Oct 2025 14:33:35 GMT
Title: PitchFlower: A flow-based neural audio codec with pitch controllability
Authors: Diego Torres, Axel Roebel, Nicolas Obin,
Abstract summary: We present PitchFlower, a flow-based neural audio with explicit pitch controllability.<n>A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio.
Score: 8.972144370022841
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

Related papers

CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space [13.286622421661313]
Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech.<n>Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity.<n>We present CodecFlow, a neural-based BWE framework that performs efficient speech reconstruction in a compact latent space.
arXiv Detail & Related papers (2026-03-02T16:03:46Z)
EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding [18.199202388702144]
Most frequency-domain neural codecs disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity.<n>This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability.<n>In this work we introduce an end-to-end complex-valued RVQ-VAE audio that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline.
arXiv Detail & Related papers (2026-01-24T16:34:07Z)
GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks [0.0]
Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks.<n>We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem.<n>Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width bounds, noise scale and clamp, and enforces a target bit-width via an exterior-point penalty.
arXiv Detail & Related papers (2025-08-19T17:05:26Z)
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models [62.38713281234756]
Binaural rendering pipeline aims to synthesize audio that mimics natural hearing based on a mono audio.<n>Many methods have been proposed to solve this problem, but they struggle with rendering quality and streamable inference.<n>We propose a flow matching based streaming speech framework called BinauralFlow synthesis framework.
arXiv Detail & Related papers (2025-05-28T20:59:15Z)
FlowDec: A flow-based full-band general audio codec with high perceptual quality [90.05968801459524]
FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz.<n>We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
arXiv Detail & Related papers (2025-03-03T12:49:09Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch [3.858078488714278]
We propose two algorithms to improve the robustness of FastPitch. First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation. The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch.
arXiv Detail & Related papers (2022-04-12T12:48:06Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for Speech Synthesis [2.4975981795360847]
Non-autoregressive neural vocoders such as WaveGlow are far behind autoregressive neural vocoders like WaveFlow in terms of modeling audio signals. NanoFlow is a state-of-the-art autoregressive neural vocoder that has immensely small parameters. We propose FlowVocoder, which has a small memory footprint and is able to generate high-fidelity audio in real-time.
arXiv Detail & Related papers (2021-09-27T06:52:55Z)
Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task. We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.