Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation
- URL: http://arxiv.org/abs/2211.04772v3
- Date: Fri, 23 Jun 2023 07:21:57 GMT
- Title: Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation
- Authors: Florian Schmid, Khaled Koutini and Gerhard Widmer
- Abstract summary: We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
- Score: 6.617487928813374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging,
outrunning previously dominating Convolutional Neural Networks (CNNs). Their
superiority is based on the ability to scale up and exploit large-scale
datasets such as AudioSet. However, Transformers are demanding in terms of
model size and computational requirements compared to CNNs. We propose a
training procedure for efficient CNNs based on offline Knowledge Distillation
(KD) from high-performing yet complex transformers. The proposed training
schema and the efficient CNN design based on MobileNetV3 results in models
outperforming previous solutions in terms of parameter and computational
efficiency and prediction performance. We provide models of different
complexity levels, scaling from low-complexity models up to a new
state-of-the-art performance of .483 mAP on AudioSet. Source Code available at:
https://github.com/fschmid56/EfficientAT
Related papers
- OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve.
We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap.
This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z) - Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio
Models [4.803510486360358]
Current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs.
We introduce dynamic CNN blocks constructed of dynamic non-linearities, dynamic convolutions and attention mechanisms.
Our experiments indicate that the introduced dynamic CNNs achieve better performance on downstream tasks and scale up well.
arXiv Detail & Related papers (2023-10-24T09:08:20Z) - A Lightweight CNN-Transformer Model for Learning Traveling Salesman
Problems [0.0]
CNN-Transformer model is able to better learn spatial features from input data using a CNN embedding layer.
The proposed model exhibits the best performance in real-world datasets.
arXiv Detail & Related papers (2023-05-03T04:28:10Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Learning General Audio Representations with Large-Scale Training of
Patchout Audio Transformers [6.002503434201551]
We study the use of audio transformers trained on large-scale datasets to learn general-purpose representations.
Our results show that representations extracted by audio transformers outperform CNN representations.
arXiv Detail & Related papers (2022-11-25T08:39:12Z) - InternImage: Exploring Large-Scale Vision Foundation Models with
Deformable Convolutions [95.94629864981091]
This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs.
The proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs.
arXiv Detail & Related papers (2022-11-10T18:59:04Z) - Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.