Latent-Domain Predictive Neural Speech Coding
- URL: http://arxiv.org/abs/2207.08363v2
- Date: Thu, 25 May 2023 12:59:19 GMT
- Title: Latent-Domain Predictive Neural Speech Coding
- Authors: Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu
- Abstract summary: This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
- Score: 22.65761249591267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural audio/speech coding has recently demonstrated its capability to
deliver high quality at much lower bitrates than traditional methods. However,
existing neural audio/speech codecs employ either acoustic features or learned
blind features with a convolutional neural network for encoding, by which there
are still temporal redundancies within encoded features. This paper introduces
latent-domain predictive coding into the VQ-VAE framework to fully remove such
redundancies and proposes the TF-Codec for low-latency neural speech coding in
an end-to-end manner. Specifically, the extracted features are encoded
conditioned on a prediction from past quantized latent frames so that temporal
correlations are further removed. Moreover, we introduce a learnable
compression on the time-frequency input to adaptively adjust the attention paid
to main frequencies and details at different bitrates. A differentiable vector
quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is
proposed to better model the latent distributions with rate constraint.
Subjective results on multilingual speech datasets show that, with low latency,
the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus
at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at
12 kbps. Numerous studies are conducted to demonstrate the effectiveness of
these techniques.
Related papers
- Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding [24.472393096460774]
We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training.
Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads.
In experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models.
arXiv Detail & Related papers (2024-10-17T17:55:26Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Cross-Scale Vector Quantization for Scalable Neural Speech Coding [22.65761249591267]
Bitrate scalability is a desirable feature for audio coding in real-time communications.
In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ)
In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves quality as more bits are available.
arXiv Detail & Related papers (2022-07-07T03:23:25Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based
Approach [4.02517560480215]
Traditional low bit-rate speech coding approach only handles narrowband speech at 8kHz.
This paper presents a new approach through vector quantization (VQ) of mel-frequency cepstral coefficients (MFCCs)
It provides better speech quality compared with the state-of-the-art classic MELPegressive at lower bit-rate.
arXiv Detail & Related papers (2021-02-04T14:37:16Z) - Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed.
A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.