Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding
- URL: http://arxiv.org/abs/2509.02244v1
- Date: Tue, 02 Sep 2025 12:14:41 GMT
- Title: Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding
- Authors: Luis Felipe Chary, Miguel Arjona Ramirez,
- Abstract summary: We present a neural speech that challenges the need for complex residual vector quantization stacks by introducing a simpler, single-stage quantization approach.<n>Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook.<n>This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a neural speech codec that challenges the need for complex residual vector quantization (RVQ) stacks by introducing a simpler, single-stage quantization approach. Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook. This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid. To ensure high-fidelity synthesis, we employ a late-stage adversarial fine-tuning for the VQ-VAE and train a HiFi-GAN vocoder from scratch on the codec's reconstructed spectrograms. Operating at approximately 7.5 kbits/s for 16 kHz speech, our system was evaluated against several state-of-the-art neural codecs using objective metrics such as STOI, PESQ, MCD, and ViSQOL. The results demonstrate that our simplified, non-residual architecture achieves competitive perceptual quality and intelligibility, validating it as an effective and open foundation for future low-latency codec designs.
Related papers
- Learning to Decode in Parallel: Self-Coordinating Neural Network for Real-Time Quantum Error Correction [16.310410074065743]
We present the first scalable, neural-network-based parallel decoding framework that simultaneously achieves SOTA accuracy and the stringent throughput required for real-time quantum error correction.<n>We demonstrate that, using our approach, a single TPU v6e is capable of decoding surface codes with distances up to 25 within 1us per decoding round.
arXiv Detail & Related papers (2026-01-14T23:04:25Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation [71.59514998928833]
U-Codec achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz.<n>We apply U-Codec into a large language model (LLM)-based auto-regressive TTS model.
arXiv Detail & Related papers (2025-10-19T05:09:20Z) - Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z) - HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z) - Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z) - L3AC: Towards a Lightweight and Lossless Audio Codec [10.903708510237875]
We introduce L3AC, a lightweight neural audio that addresses challenges by leveraging a single quantizer and a highly efficient architecture.<n>L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv--a novel structure designed to capture acoustic variations across multiple temporal scales.
arXiv Detail & Related papers (2025-04-07T11:34:39Z) - Variable Bitrate Residual Vector Quantization for Audio Coding [29.368893236587343]
Recent neural audio compression models have progressively adopted residual vector quantization (RVQ)<n>These models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoffs.<n>We propose variable RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame.
arXiv Detail & Related papers (2024-10-08T13:18:24Z) - Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos.
Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs.
A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Variational Autoencoders: A Harmonic Perspective [79.49579654743341]
We study Variational Autoencoders (VAEs) from the perspective of harmonic analysis.
We show that the encoder variance of a VAE controls the frequency content of the functions parameterised by the VAE encoder and decoder neural networks.
arXiv Detail & Related papers (2021-05-31T10:39:25Z) - Parallelized Rate-Distortion Optimized Quantization Using Deep Learning [9.886383889250064]
Rate-Distortion Optimized Quantization (RDOQ) has played an important role in the coding performance of recent video compression standards such as H.264/AVC, H.265/HEVC, VP9 and AV1.
This work addresses this limitation using a neural network-based approach, which learns to trade-off rate and distortion during offline supervised training.
arXiv Detail & Related papers (2020-12-11T14:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.