Related papers: CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

URL: http://arxiv.org/abs/2203.06760v1
Date: Sun, 13 Mar 2022 21:14:04 GMT
Title: CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Authors: Yuan Gong, Sameer Khurana, Andrew Rouditchenko, and James Glass
Abstract summary: convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs.
Score: 11.505633449307684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs. In this paper, we find an intriguing interaction between the two very different models - CNN and AST models are good teachers for each other. When we use either of them as the teacher and train the other model as the student via knowledge distillation (KD), the performance of the student model noticeably improves, and in many cases, is better than the teacher model. In our experiments with this CNN/Transformer Cross-Model Knowledge Distillation (CMKD) method we achieve new state-of-the-art performance on FSD50K, AudioSet, and ESC-50.

Related papers

AFEN: Respiratory Disease Classification using Ensemble Learning [2.524195881002773]
We present AFEN (Audio Feature Learning), a model that leverages Convolutional Neural Networks (CNN) and XGBoost. We use a meticulously selected mix of audio features which provide the salient attributes of the data and allow for accurate classification. We empirically verify that AFEN sets a new state-of-theart using Precision and Recall as metrics, while decreasing training time by 60%.
arXiv Detail & Related papers (2024-05-08T23:50:54Z)
OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation [70.17681136234202]
We reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We propose two key components, i.e., adaptive receptive fields (spatially) and adaptive relation, to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a lightweight module.
arXiv Detail & Related papers (2024-03-21T14:06:38Z)
Distilling Efficient Vision Transformers from CNNs for Semantic Segmentation [12.177329445930276]
We propose a novel CNN-to-ViT KD framework, dubbed C2VKD. We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations. We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
arXiv Detail & Related papers (2023-10-11T07:45:37Z)
Robust Mixture-of-Expert Training for Convolutional Neural Networks [141.3531209949845]
Sparsely-gated Mixture of Expert (MoE) has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. We propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. We find that AdvMoE achieves 1% 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE.
arXiv Detail & Related papers (2023-08-19T20:58:21Z)
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z)
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. In this study, we focus on transferring knowledge for video classification tasks. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z)
SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z)
AST: Audio Spectrogram Transformer [21.46018186487818]
We introduce the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. AST achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
arXiv Detail & Related papers (2021-04-05T05:26:29Z)
A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness. Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set. Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z)
Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z)
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer. We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.