Related papers: Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

URL: http://arxiv.org/abs/2507.12701v1
Date: Thu, 17 Jul 2025 00:32:07 GMT
Title: Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine
Authors: Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, Minje Kim,
Abstract summary: This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model.<n>Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low codecs (i.e., less than 200 bps) with a minimal loss of the downstream model performance.
Score: 16.046905753937384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.

Related papers

Learning to Upsample and Upmix Audio in the Latent Domain [13.82572699087732]
Neural audio autoencoders create compact latent representations that preserve perceptually important information.<n>We propose a framework that performs audio processing operations entirely within an autoencoder's latent space.<n>We demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio.
arXiv Detail & Related papers (2025-05-31T19:27:22Z)
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression [68.31184784672227]
In modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks.<n>It is therefore useful to optimize the encoder for a downstream task instead of for image quality.<n>Here, we address this challenge by controlling the Quantization Parameters (QPs) at the macro-block level to optimize the downstream task.
arXiv Detail & Related papers (2025-01-21T15:36:08Z)
Variable Bitrate Residual Vector Quantization for Audio Coding [29.368893236587343]
Recent neural audio compression models have progressively adopted residual vector quantization (RVQ)<n>These models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoffs.<n>We propose variable RVQ (VRVQ) for audio codecs, which allows for more efficient coding by adapting the number of codebooks used per frame.
arXiv Detail & Related papers (2024-10-08T13:18:24Z)
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings. We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.