A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT
Domain
- URL: http://arxiv.org/abs/2201.12039v1
- Date: Fri, 28 Jan 2022 11:08:02 GMT
- Title: A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT
Domain
- Authors: Kishan Gupta, Srikanth Korse, Bernd Edler, Guillaume Fuchs
- Abstract summary: We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay.
The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network.
Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
- Score: 16.70806998451696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Frequency domain processing, and in particular the use of Modified Discrete
Cosine Transform (MDCT), is the most widespread approach to audio coding.
However, at low bitrates, audio quality, especially for speech, degrades
drastically due to the lack of available bits to directly code the transform
coefficients. Traditionally, post-filtering has been used to mitigate artefacts
in the coded speech by exploiting a-priori information of the source and extra
transmitted parameters. Recently, data-driven post-filters have shown better
results, but at the cost of significant additional complexity and delay. In
this work, we propose a mask-based post-filter operating directly in MDCT
domain of the codec, inducing no extra delay. The real-valued mask is applied
to the quantized MDCT coefficients and is estimated from a relatively
lightweight convolutional encoder-decoder network. Our solution is tested on
the recently standardized low-delay, low-complexity codec (LC3) at lowest
possible bitrate of 16 kbps. Objective and subjective assessments clearly show
the advantage of this approach over the conventional post-filter, with an
average improvement of 10 MUSHRA points over the LC3 coded speech.
Related papers
- NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Denoising Diffusion Error Correction Codes [92.10654749898927]
Recently, neural decoders have demonstrated their advantage over classical decoding techniques.
Recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders.
We propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths.
arXiv Detail & Related papers (2022-09-16T11:00:50Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - Improved decoding of circuit noise and fragile boundaries of tailored
surface codes [61.411482146110984]
We introduce decoders that are both fast and accurate, and can be used with a wide class of quantum error correction codes.
Our decoders, named belief-matching and belief-find, exploit all noise information and thereby unlock higher accuracy demonstrations of QEC.
We find that the decoders led to a much higher threshold and lower qubit overhead in the tailored surface code with respect to the standard, square surface code.
arXiv Detail & Related papers (2022-03-09T18:48:54Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression.
The proposed CNN autoencoder also defines quantization and coding as a trainable module.
Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z) - Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based
Approach [4.02517560480215]
Traditional low bit-rate speech coding approach only handles narrowband speech at 8kHz.
This paper presents a new approach through vector quantization (VQ) of mel-frequency cepstral coefficients (MFCCs)
It provides better speech quality compared with the state-of-the-art classic MELPegressive at lower bit-rate.
arXiv Detail & Related papers (2021-02-04T14:37:16Z) - Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed.
A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z) - Optimization of data-driven filterbank for automatic speaker
verification [8.175789701289512]
We propose a new data-driven filter design method which optimize filter parameters from a given speech data.
The main advantage of the proposed method is that it requires very limited amount of unlabeled speech-data.
We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases.
arXiv Detail & Related papers (2020-07-21T11:42:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.