LEAN: Light and Efficient Audio Classification Network
- URL: http://arxiv.org/abs/2305.12712v1
- Date: Mon, 22 May 2023 04:45:04 GMT
- Title: LEAN: Light and Efficient Audio Classification Network
- Authors: Shwetank Choudhary, CR Karthik, Punuru Sri Lakshmi and Sumit Kumar
- Abstract summary: We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
- Score: 1.5070398746522742
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the past few years, audio classification task on large-scale dataset
such as AudioSet has been an important research area. Several deeper
Convolution-based Neural networks have shown compelling performance notably
Vggish, YAMNet, and Pretrained Audio Neural Network (PANN). These models are
available as pretrained architecture for transfer learning as well as specific
audio task adoption. In this paper, we propose a lightweight on-device deep
learning-based model for audio classification, LEAN. LEAN consists of a raw
waveform-based temporal feature extractor called as Wave Encoder and
logmel-based Pretrained YAMNet. We show that using a combination of trainable
wave encoder, Pretrained YAMNet along with cross attention-based temporal
realignment, results in competitive performance on downstream audio
classification tasks with lesser memory footprints and hence making it suitable
for resource constraints devices such as mobile, edge devices, etc . Our
proposed system achieves on-device mean average precision(mAP) of .445 with a
memory footprint of a mere 4.5MB on the FSD50K dataset which is an improvement
of 22% over baseline on-device mAP on same dataset.
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural
Networks [20.931028377435034]
We show how to reduce the computational complexity and memory requirement of the PANNs model.
The code for the E-PANNs model has been released under an open source license.
arXiv Detail & Related papers (2023-05-30T00:08:55Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for
Embedded Speech and Audio Processing from Decentralised Data [0.0]
We introduce DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition.
The framework creates and augments Mel-spectrogram plots on-the-fly from raw audio signals which are then used to finetune specific pre-trained CNNs.
The whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone.
arXiv Detail & Related papers (2021-04-23T14:32:33Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.