Keep what you need : extracting efficient subnetworks from large audio representation models
- URL: http://arxiv.org/abs/2502.12925v1
- Date: Tue, 18 Feb 2025 15:04:33 GMT
- Title: Keep what you need : extracting efficient subnetworks from large audio representation models
- Authors: David Genova, Philippe Esling, Tom Hurlin,
- Abstract summary: We introduce learnable binary masks in-between the layers of a pretrained representation model.
When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective.
Once trained, the masked computational units can then be removed from the network, implying significant performance gains.
- Score: 0.8798470556253869
- License:
- Abstract: Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation models. Specifically, we introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective, hence learning a compact subnetwork specialized on a single task. Importantly, the weights of the foundation model are kept frozen, resulting into low additional training costs. Once trained, the masked computational units can then be removed from the network, implying significant performance gains. We assess our method on three widespread audio foundation models, each based on a different backbone architecture, and illustrate its effectiveness on common audio representation evaluation tasks, as well as its versatility on both speech, music, and general audio. Code for reproducing the results and supporting webpage are available at https://github.com/gnvIRCAM/Audio-representation-trimming
Related papers
- Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC [77.8851460746251]
We propose a simple, efficient, and general approach to fine-tune diffusion models.
ONE-PIC enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules.
Our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs.
arXiv Detail & Related papers (2024-12-07T11:19:32Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Structured Pruning of Self-Supervised Pre-trained Models for Speech
Recognition and Understanding [43.68557263195205]
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
We propose three task-specific structured pruning methods to deal with such heterogeneous networks.
Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vecbase with 10% to 30% less, and is able to reduce the computation by 40% to 50% without any degradation.
arXiv Detail & Related papers (2023-02-27T20:39:54Z) - Learning General Audio Representations with Large-Scale Training of
Patchout Audio Transformers [6.002503434201551]
We study the use of audio transformers trained on large-scale datasets to learn general-purpose representations.
Our results show that representations extracted by audio transformers outperform CNN representations.
arXiv Detail & Related papers (2022-11-25T08:39:12Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Backbones-Review: Feature Extraction Networks for Deep Learning and Deep
Reinforcement Learning Approaches [3.255610188565679]
CNNs allow to work on large-scale size of data, as well as cover different scenarios for a specific task.
Many networks have been proposed and become the famous networks used for any DL models in any AI task.
A backbone is a known network trained in many other tasks before and demonstrates its effectiveness.
arXiv Detail & Related papers (2022-06-16T09:18:34Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Diet deep generative audio models with structured lottery [2.348805691644086]
We study the lottery ticket hypothesis on deep generative audio models.
We show that we can remove up to 95% of the model weights without significant degradation in accuracy.
We discuss the possibility of implementing deep generative audio models on embedded platforms.
arXiv Detail & Related papers (2020-07-31T16:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.