Temporal Knowledge Distillation for On-device Audio Classification
- URL: http://arxiv.org/abs/2110.14131v1
- Date: Wed, 27 Oct 2021 02:29:54 GMT
- Title: Temporal Knowledge Distillation for On-device Audio Classification
- Authors: Kwanghee Choi, Martin Kersner, Jacob Morton, and Buru Chang
- Abstract summary: We propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large models to on-device models.
Our proposed method improves the predictive performance across diverse on-device architectures.
- Score: 2.2731658205414025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving the performance of on-device audio classification models remains a
challenge given the computational limits of the mobile environment. Many
studies leverage knowledge distillation to boost predictive performance by
transferring the knowledge from large models to on-device models. However, most
lack the essence of the temporal information which is crucial to audio
classification tasks, or similar architecture is often required. In this paper,
we propose a new knowledge distillation method designed to incorporate the
temporal knowledge embedded in attention weights of large models to on-device
models. Our distillation method is applicable to various types of
architectures, including the non-attention-based architectures such as CNNs or
RNNs, without any architectural change during inference. Through extensive
experiments on both an audio event detection dataset and a noisy keyword
spotting dataset, we show that our proposed method improves the predictive
performance across diverse on-device architectures.
Related papers
- Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z) - Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition [62.85802939587308]
This paper focuses on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR)
Since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting.
We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models.
arXiv Detail & Related papers (2024-01-11T23:00:24Z) - Topology combined machine learning for consonant recognition [8.188982461393278]
TopCap is capable of capturing features rarely detected in datasets with low dimensional intrinsicity.
In classifying voiced and voiceless consonants, TopCap achieves an accuracy exceeding 96%.
TopCap is geared towards designing topological convolutional layers for deep learning of speech and audio signals.
arXiv Detail & Related papers (2023-11-26T06:53:56Z) - S4Sleep: Elucidating the design space of deep-learning-based sleep stage classification models [1.068128849363198]
This study investigates the design choices within the broad category of encoder-predictor architectures.
We identify robust architectures applicable to both time series and spectrogram input representations.
These architectures incorporate structured state space models as integral components and achieve statistically significant performance improvements.
arXiv Detail & Related papers (2023-10-10T15:42:14Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Robust Audio Anomaly Detection [10.75127981612396]
The presented approach doesn't assume the presence of labeled anomalies in the training dataset.
The temporal dynamics are modeled using recurrent layers augmented with attention mechanism.
The output of the network is an outlier robust probability density function.
arXiv Detail & Related papers (2022-02-03T17:19:42Z) - Leveraging the structure of dynamical systems for data-driven modeling [111.45324708884813]
We consider the impact of the training set and its structure on the quality of the long-term prediction.
We show how an informed design of the training set, based on invariants of the system and the structure of the underlying attractor, significantly improves the resulting models.
arXiv Detail & Related papers (2021-12-15T20:09:20Z) - Single-Layer Vision Transformers for More Accurate Early Exits with Less
Overhead [88.17413955380262]
We introduce a novel architecture for early exiting based on the vision transformer architecture.
We show that our method works for both classification and regression problems.
We also introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis.
arXiv Detail & Related papers (2021-05-19T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.