E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural
Networks
- URL: http://arxiv.org/abs/2305.18665v1
- Date: Tue, 30 May 2023 00:08:55 GMT
- Title: E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural
Networks
- Authors: Arshdeep Singh, Haohe Liu, Mark D. Plumbley
- Abstract summary: We show how to reduce the computational complexity and memory requirement of the PANNs model.
The code for the E-PANNs model has been released under an open source license.
- Score: 20.931028377435034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sounds carry an abundance of information about activities and events in our
everyday environment, such as traffic noise, road works, music, or people
talking. Recent machine learning methods, such as convolutional neural networks
(CNNs), have been shown to be able to automatically recognize sound activities,
a task known as audio tagging. One such method, pre-trained audio neural
networks (PANNs), provides a neural network which has been pre-trained on over
500 sound classes from the publicly available AudioSet dataset, and can be used
as a baseline or starting point for other tasks. However, the existing PANNs
model has a high computational complexity and large storage requirement. This
could limit the potential for deploying PANNs on resource-constrained devices,
such as on-the-edge sound sensors, and could lead to high energy consumption if
many such devices were deployed. In this paper, we reduce the computational
complexity and memory requirement of the PANNs model by taking a pruning
approach to eliminate redundant parameters from the PANNs model. The resulting
Efficient PANNs (E-PANNs) model, which requires 36\% less computations and 70\%
less memory, also slightly improves the sound recognition (audio tagging)
performance. The code for the E-PANNs model has been released under an open
source license.
Related papers
- DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss.
Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z) - Exploring Green AI for Audio Deepfake Detection [21.17957700009653]
State-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance.
Deep NLP models produce around 626k lbs of COtextsubscript2 which is equivalent to five times of average US car emission at its lifetime.
This study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources.
arXiv Detail & Related papers (2024-03-21T10:54:21Z) - sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z) - BayesSpeech: A Bayesian Transformer Network for Automatic Speech
Recognition [0.0]
Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks.
We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.
arXiv Detail & Related papers (2023-01-16T16:19:04Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Neural Architecture Search for Energy Efficient Always-on Audio Models [1.3846912186423144]
We present several changes to neural architecture searches (NAS) that improve the chance of success in practical situations.
We benchmark the performance of our search on real hardware, but since running thousands of tests with real hardware is difficult we use a random forest model to roughly predict the energy usage of a candidate network.
Our search, evaluated on a sound-event classification dataset based upon AudioSet, results in an order of magnitude less energy per inference and a much smaller memory footprint.
arXiv Detail & Related papers (2022-02-09T06:10:18Z) - Event Based Time-Vectors for auditory features extraction: a
neuromorphic approach for low power audio recognition [4.206844212918807]
We present a neuromorphic architecture, capable of unsupervised auditory feature recognition.
We then validate the network on a subset of Google's Speech Commands dataset.
arXiv Detail & Related papers (2021-12-13T21:08:04Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS)
We employ a one-shot architecture search approach in order to obtain a reduced search cost.
We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.