Related papers: A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain

URL: http://arxiv.org/abs/2201.12039v1
Date: Fri, 28 Jan 2022 11:08:02 GMT
Title: A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain
Authors: Kishan Gupta, Srikanth Korse, Bernd Edler, Guillaume Fuchs
Abstract summary: We propose a mask-based post-filter operating directly in MDCT domain, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity (LC3) at lowest possible coefficients of 16 kbps.
Score: 16.70806998451696
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Frequency domain processing, and in particular the use of Modified Discrete Cosine Transform (MDCT), is the most widespread approach to audio coding. However, at low bitrates, audio quality, especially for speech, degrades drastically due to the lack of available bits to directly code the transform coefficients. Traditionally, post-filtering has been used to mitigate artefacts in the coded speech by exploiting a-priori information of the source and extra transmitted parameters. Recently, data-driven post-filters have shown better results, but at the cost of significant additional complexity and delay. In this work, we propose a mask-based post-filter operating directly in MDCT domain of the codec, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity codec (LC3) at lowest possible bitrate of 16 kbps. Objective and subjective assessments clearly show the advantage of this approach over the conventional post-filter, with an average improvement of 10 MUSHRA points over the LC3 coded speech.

Related papers

FlowDec: A flow-based full-band general audio codec with high perceptual quality [90.05968801459524]
FlowDec is a neural full-band audio codecs for general audio sampled at 48 kHz. We generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s.
arXiv Detail & Related papers (2025-03-03T12:49:09Z)
FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates [10.14555083237668]
FlowMAC is a novel neural audio for high-quality general audio compression at low bit rates based on conditional flow matching (CFM) FlowMAC achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate.
arXiv Detail & Related papers (2024-09-26T08:32:31Z)
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement [1.4037575966075835]
1-D filters on raw audio are hard to train and often suffer from instabilities. We address these problems with hybrid solutions, combining theory-driven and data-driven approaches.
arXiv Detail & Related papers (2024-08-30T15:49:31Z)
NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction. The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network. A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z)
Denoising Diffusion Error Correction Codes [92.10654749898927]
Recently, neural decoders have demonstrated their advantage over classical decoding techniques. Recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders. We propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths.
arXiv Detail & Related papers (2022-09-16T11:00:50Z)
Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework. We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z)
Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z)
Improved decoding of circuit noise and fragile boundaries of tailored surface codes [61.411482146110984]
We introduce decoders that are both fast and accurate, and can be used with a wide class of quantum error correction codes. Our decoders, named belief-matching and belief-find, exploit all noise information and thereby unlock higher accuracy demonstrations of QEC. We find that the decoders led to a much higher threshold and lower qubit overhead in the tailored surface code with respect to the standard, square surface code.
arXiv Detail & Related papers (2022-03-09T18:48:54Z)
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z)
Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z)
Scalable and Efficient Neural Speech Coding [24.959825692325445]
This work presents a scalable and efficient neural waveform (NWC) for speech compression. The proposed CNN autoencoder also defines quantization and coding as a trainable module. Compared to the other autoregressive decoder-based neural speech, our decoder has significantly smaller architecture.
arXiv Detail & Related papers (2021-03-27T00:10:16Z)
Enhancement Of Coded Speech Using a Mask-Based Post-Filter [9.324642081509754]
A data-driven post-filter relying on masking in the time-frequency domain is proposed. A fully connected neural network (FCNN), a convolutional encoder-decoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per time-frequency bin.
arXiv Detail & Related papers (2020-10-12T09:48:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.