PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders
- URL: http://arxiv.org/abs/2404.02702v3
- Date: Thu, 21 Nov 2024 10:31:03 GMT
- Title: PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders
- Authors: Yu Pan, Xiang Zhang, Yuguang Yang, Jixun Yao, Yanni Hu, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao,
- Abstract summary: PSCodec is a series of neural speech codecs based on prompt encoders.
PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN are capable of delivering high-performance speech reconstruction with low bandwidths.
- Score: 9.998721582869438
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity (SSIM) based disentangled representation loss (DRL) and an incremental continuous training (ICT) strategy. While PSCodec-DRL-ICT demonstrates impressive performance, its reliance on extensive hyperparameter tuning and multi-stage training makes it somewhat labor-intensive. To circumvent these limitations, we propose PSCodec-CasAN, utilizing an advanced cascaded attention network (CasAN) to enhance representational capacity of the entire system. Extensive experiments show that our proposed PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN all significantly outperform several state-of-the-art neural codecs, exhibiting substantial improvements in both speech reconstruction quality and speaker similarity under low-bitrate conditions.
Related papers
- Baseline Systems For The 2025 Low-Resource Audio Codec Challenge [0.0]
The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio coding for deployment in resource-constrained environments.<n>This paper presents the official baseline systems for both tracks in the 2025 LRAC Challenge.
arXiv Detail & Related papers (2025-09-30T20:36:58Z) - Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates [1.445167946386569]
We show that Finite Scalar Quantization (FSQ) encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels.<n>We demonstrate that FSQ has vastly superior bit-level perturbation by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
arXiv Detail & Related papers (2025-09-11T15:39:59Z) - SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z) - HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z) - Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z) - Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z) - Plug-and-Play Versatile Compressed Video Enhancement [57.62582951699999]
Video compression effectively reduces the size of files, making it possible for real-time cloud computing.
However, it comes at the cost of visual quality, challenges the robustness of downstream vision models.
We present a versatile-aware enhancement framework that adaptively enhance videos under different compression settings.
arXiv Detail & Related papers (2025-04-21T18:39:31Z) - FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.
Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Compression-Realized Deep Structural Network for Video Quality Enhancement [78.13020206633524]
This paper focuses on the task of quality enhancement for compressed videos.
Most of the existing methods lack a structured design to optimally leverage the priors within compression codecs.
A new paradigm is urgently needed for a more conscious'' process of quality enhancement.
arXiv Detail & Related papers (2024-05-10T09:18:17Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient
Neural Image Compression [11.25130799452367]
We propose an absolute image compression transformer (ICT) for neural image compression (NIC)
ICT captures both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents.
Our framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural SwinT-ChARM.
arXiv Detail & Related papers (2023-07-05T13:17:14Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z) - Neural JPEG: End-to-End Image Compression Leveraging a Standard JPEG
Encoder-Decoder [73.48927855855219]
We propose a system that learns to improve the encoding performance by enhancing its internal neural representations on both the encoder and decoder ends.
Experiments demonstrate that our approach successfully improves the rate-distortion performance over JPEG across various quality metrics.
arXiv Detail & Related papers (2022-01-27T20:20:03Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression.
The proposed CNN autoencoder also defines quantization and coding as a trainable module.
Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.