Psychoacoustic Calibration of Loss Functions for Efficient End-to-End
Neural Audio Coding
- URL: http://arxiv.org/abs/2101.00054v1
- Date: Thu, 31 Dec 2020 19:46:46 GMT
- Title: Psychoacoustic Calibration of Loss Functions for Efficient End-to-End
Neural Audio Coding
- Authors: Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, Minje Kim
- Abstract summary: We present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems.
With the proposed method, a lightweight neural, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III at 112 kbps.
- Score: 30.307627653506756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional audio coding technologies commonly leverage human perception of
sound, or psychoacoustics, to reduce the bitrate while preserving the
perceptual quality of the decoded audio signals. For neural audio codecs,
however, the objective nature of the loss function usually leads to suboptimal
sound quality as well as high run-time complexity due to the large model size.
In this work, we present a psychoacoustic calibration scheme to re-define the
loss functions of neural audio coding systems so that it can decode signals
more perceptually similar to the reference, yet with a much lower model
complexity. The proposed loss function incorporates the global masking
threshold, allowing the reconstruction error that corresponds to inaudible
artifacts. Experimental results show that the proposed model outperforms the
baseline neural codec twice as large and consuming 23.4% more bits per second.
With the proposed method, a lightweight neural codec, with only 0.9 million
parameters, performs near-transparent audio coding comparable with the
commercial MPEG-1 Audio Layer III codec at 112 kbps.
Related papers
- High-Fidelity Music Vocoder using Neural Audio Codecs [18.95453617434051]
DisCoder is a neural vocoder that reconstructs high-fidelity 44.1 kHz audio from mel spectrograms.
DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study.
Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
arXiv Detail & Related papers (2025-02-18T11:25:46Z) - Efficient Evaluation of Quantization-Effects in Neural Codecs [4.897318643396687]
Training neural codecs requires techniques to allow a non-zero gradient across the quantizer.
This paper proposes an efficient evaluation framework for neural codecs using simulated data.
We validate our findings against an internal neural audio gradient and against the state-of-the-art descript-audio-codec.
arXiv Detail & Related papers (2025-02-07T09:11:19Z) - The END: An Equivariant Neural Decoder for Quantum Error Correction [73.4384623973809]
We introduce a data efficient neural decoder that exploits the symmetries of the problem.
We propose a novel equivariant architecture that achieves state of the art accuracy compared to previous neural decoders.
arXiv Detail & Related papers (2023-04-14T19:46:39Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Improved decoding of circuit noise and fragile boundaries of tailored
surface codes [61.411482146110984]
We introduce decoders that are both fast and accurate, and can be used with a wide class of quantum error correction codes.
Our decoders, named belief-matching and belief-find, exploit all noise information and thereby unlock higher accuracy demonstrations of QEC.
We find that the decoders led to a much higher threshold and lower qubit overhead in the tailored surface code with respect to the standard, square surface code.
arXiv Detail & Related papers (2022-03-09T18:48:54Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.