Related papers: E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks

E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks

URL: http://arxiv.org/abs/2305.18665v1
Date: Tue, 30 May 2023 00:08:55 GMT
Title: E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks
Authors: Arshdeep Singh, Haohe Liu, Mark D. Plumbley
Abstract summary: We show how to reduce the computational complexity and memory requirement of the PANNs model. The code for the E-PANNs model has been released under an open source license.
Score: 20.931028377435034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs), provides a neural network which has been pre-trained on over 500 sound classes from the publicly available AudioSet dataset, and can be used as a baseline or starting point for other tasks. However, the existing PANNs model has a high computational complexity and large storage requirement. This could limit the potential for deploying PANNs on resource-constrained devices, such as on-the-edge sound sensors, and could lead to high energy consumption if many such devices were deployed. In this paper, we reduce the computational complexity and memory requirement of the PANNs model by taking a pruning approach to eliminate redundant parameters from the PANNs model. The resulting Efficient PANNs (E-PANNs) model, which requires 36\% less computations and 70\% less memory, also slightly improves the sound recognition (audio tagging) performance. The code for the E-PANNs model has been released under an open source license.

Related papers

NNTile: a machine learning framework capable of training extremely large GPT language models on a single node [83.9328245724548]
NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units. It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices.
arXiv Detail & Related papers (2025-04-17T16:22:32Z)
Keep what you need : extracting efficient subnetworks from large audio representation models [0.8798470556253869]
We introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective. Once trained, the masked computational units can then be removed from the network, implying significant performance gains.
arXiv Detail & Related papers (2025-02-18T15:04:33Z)
DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z)
Exploring Green AI for Audio Deepfake Detection [21.17957700009653]
State-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Deep NLP models produce around 626k lbs of COtextsubscript2 which is equivalent to five times of average US car emission at its lifetime. This study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources.
arXiv Detail & Related papers (2024-03-21T10:54:21Z)
sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD. It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z)
LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN. LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet. We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z)
BayesSpeech: A Bayesian Transformer Network for Automatic Speech Recognition [0.0]
Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.
arXiv Detail & Related papers (2023-01-16T16:19:04Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information. We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF) The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
Neural Architecture Search for Energy Efficient Always-on Audio Models [1.3846912186423144]
We present several changes to neural architecture searches (NAS) that improve the chance of success in practical situations. We benchmark the performance of our search on real hardware, but since running thousands of tests with real hardware is difficult we use a random forest model to roughly predict the energy usage of a candidate network. Our search, evaluated on a sound-event classification dataset based upon AudioSet, results in an order of magnitude less energy per inference and a much smaller memory footprint.
arXiv Detail & Related papers (2022-02-09T06:10:18Z)
Event Based Time-Vectors for auditory features extraction: a neuromorphic approach for low power audio recognition [4.206844212918807]
We present a neuromorphic architecture, capable of unsupervised auditory feature recognition. We then validate the network on a subset of Google's Speech Commands dataset.
arXiv Detail & Related papers (2021-12-13T21:08:04Z)
ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware. The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z)
MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS) We employ a one-shot architecture search approach in order to obtain a reduced search cost. We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.