Related papers: Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

URL: http://arxiv.org/abs/2207.02262v1
Date: Tue, 5 Jul 2022 18:52:11 GMT
Title: Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
Authors: Ali Siahkoohi and Michael Chinen and Tom Denton and W. Bastiaan Kleijn and Jan Skoglund
Abstract summary: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. We use pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias.
Score: 28.400364949575103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.

Related papers

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling [6.313337261965531]
We introduce HH-Codec, a neural codecs that achieves extreme compression at 24 tokens per second for 24 kHz audio.<n>Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss.<n> HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps.
arXiv Detail & Related papers (2025-07-25T02:44:30Z)
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech. Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z)
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [14.7377193484733]
We propose LSCodec, a discrete speech that has both low and speaker decoupling ability. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines.
arXiv Detail & Related papers (2024-10-21T08:23:31Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec [55.95078490630001]
This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
arXiv Detail & Related papers (2023-09-14T03:18:24Z)
RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization. We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z)
Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework. We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models. We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words. Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z)
A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s. The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system. Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.