Related papers: Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs

Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs

URL: http://arxiv.org/abs/2408.06954v2
Date: Tue, 07 Jan 2025 04:11:55 GMT
Title: Neural Speech and Audio Coding: Modern AI Technology Meets Traditional Codecs
Authors: Minje Kim, Jan Skoglund,
Abstract summary: The paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems.<n>It introduces a neural network-based signal enhancer designed to post-process existing codecs' output.<n>The paper examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs.
Score: 19.437080345021105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs' output, along with the autoencoder-based end-to-end models and LPCNet--hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques.

Related papers

Advances in Intelligent Hearing Aids: Deep Learning Approaches to Selective Noise Cancellation [0.0]
This systematic literature review evaluates advances in AI-driven selective noise cancellation for hearing aids.<n>We synthesize findings across deep learning architectures, hardware deployment strategies, clinical validation studies, and user-centric design.<n>Key findings include significant gains over traditional methods, with recent models achieving up to 18.3 dB SI-SDR improvement on noisy-reverberant benchmarks.
arXiv Detail & Related papers (2025-06-25T15:05:16Z)
Towards Generalized Source Tracing for Codec-Based Deepfake Speech [52.68106957822706]
We introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding.<n>Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
arXiv Detail & Related papers (2025-06-08T21:36:10Z)
Efficient Evaluation of Quantization-Effects in Neural Codecs [4.897318643396687]
Training neural codecs requires techniques to allow a non-zero gradient across the quantizer. This paper proposes an efficient evaluation framework for neural codecs using simulated data. We validate our findings against an internal neural audio gradient and against the state-of-the-art descript-audio-codec.
arXiv Detail & Related papers (2025-02-07T09:11:19Z)
Quantum-Trained Convolutional Neural Network for Deepfake Audio Detection [3.2927352068925444]
deepfake technologies pose challenges to privacy, security, and information integrity. This paper introduces a Quantum-Trained Convolutional Neural Network framework designed to enhance the detection of deepfake audio.
arXiv Detail & Related papers (2024-10-11T20:52:10Z)
SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Erasure Coded Neural Network Inference via Fisher Averaging [28.243239815823205]
Erasure-coded computing has been successfully used in cloud systems to reduce tail latency caused by factors such as straggling servers and heterogeneous traffic variations. We design a method to code over neural networks, that is, given two or more neural network models, how to construct a coded model whose output is a linear combination of the outputs of the given neural networks. We conduct experiments to perform erasure coding over neural networks trained on real-world vision datasets and show that the accuracy of the decoded outputs using COIN is significantly higher than other baselines.
arXiv Detail & Related papers (2024-09-02T18:46:26Z)
Neural Harmonium: An Interpretable Deep Structure for Nonlinear Dynamic System Identification with Application to Audio Processing [4.599180419117645]
Interpretability helps us understand a model's ability to generalize and reveal its limitations. We introduce a causal interpretable deep structure for modeling dynamic systems. Our proposed model makes use of the harmonic analysis by modeling the system in a time-frequency domain.
arXiv Detail & Related papers (2023-10-10T21:32:15Z)
Channelformer: Attention based Neural Solution for Wireless Channel Estimation and Effective Online Training [1.0499453838486013]
We propose an encoder-decoder neural architecture (called Channelformer) to achieve improved channel estimation. We employ multi-head attention in the encoder and a residual convolutional neural architecture as the decoder. We also propose an effective online training method based on the fifth generation (5G) new radio (NR) configuration for the modern communication systems.
arXiv Detail & Related papers (2023-02-08T23:18:23Z)
Ultrasound Signal Processing: From Models to Deep Learning [64.56774869055826]
Medical ultrasound imaging relies heavily on high-quality signal processing to provide reliable and interpretable image reconstructions. Deep learning based methods, which are optimized in a data-driven fashion, have gained popularity. A relatively new paradigm combines the power of the two: leveraging data-driven deep learning, as well as exploiting domain knowledge.
arXiv Detail & Related papers (2022-04-09T13:04:36Z)
Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image. The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Learn to Communicate with Neural Calibration: Scalability and Generalization [10.775558382613077]
We propose a scalable and generalizable neural calibration framework for future wireless system design. The proposed neural calibration framework is applied to solve challenging resource management problems in massive multiple-input multiple-output (MIMO) systems.
arXiv Detail & Related papers (2021-10-01T09:00:25Z)
Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
Supervised DKRC with Images for Offline System Identification [77.34726150561087]
Modern dynamical systems are becoming increasingly non-linear and complex. There is a need for a framework to model these systems in a compact and comprehensive representation for prediction and control. Our approach learns these basis functions using a supervised learning approach.
arXiv Detail & Related papers (2021-09-06T04:39:06Z)
Model-Based Deep Learning [155.063817656602]
Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques. Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance. We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches.
arXiv Detail & Related papers (2020-12-15T16:29:49Z)
AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.