Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
- URL: http://arxiv.org/abs/2207.02262v1
- Date: Tue, 5 Jul 2022 18:52:11 GMT
- Title: Ultra-Low-Bitrate Speech Coding with Pretrained Transformers
- Authors: Ali Siahkoohi and Michael Chinen and Tom Denton and W. Bastiaan Kleijn
and Jan Skoglund
- Abstract summary: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion.
We use pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias.
- Score: 28.400364949575103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech coding facilitates the transmission of speech over low-bandwidth
networks with minimal distortion. Neural-network based speech codecs have
recently demonstrated significant improvements in quality over traditional
approaches. While this new generation of codecs is capable of synthesizing
high-fidelity speech, their use of recurrent or convolutional layers often
restricts their effective receptive fields, which prevents them from
compressing speech efficiently. We propose to further reduce the bitrate of
neural speech codecs through the use of pretrained Transformers, capable of
exploiting long-range dependencies in the input signal due to their inductive
bias. As such, we use a pretrained Transformer in tandem with a convolutional
encoder, which is trained end-to-end with a quantizer and a generative
adversarial net decoder. Our numerical experiments show that supplementing the
convolutional encoder of a neural speech codec with Transformer speech
embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that
outperforms the original neural speech codec in synthesized speech quality when
trained at the same bitrate. Subjective human evaluations suggest that the
quality of the resulting codec is comparable or better than that of
conventional codecs operating at three to four times the rate.
Related papers
- LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [14.7377193484733]
We propose LSCodec, a discrete speech that has both low and speaker decoupling ability.
By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines.
arXiv Detail & Related papers (2024-10-21T08:23:31Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
for Neural Speech Codec [55.95078490630001]
This paper presents FunCodec, a fundamental neural speech toolkit, which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the latest neural speech models, such as SoundStream and Encodec.
Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes.
arXiv Detail & Related papers (2023-09-14T03:18:24Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate [8.312162364318235]
We present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s.
The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner.
arXiv Detail & Related papers (2021-08-09T14:03:07Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system.
Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame.
Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.