HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound
Classification and Detection
- URL: http://arxiv.org/abs/2202.00874v1
- Date: Wed, 2 Feb 2022 04:49:14 GMT
- Title: HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound
Classification and Detection
- Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick,
Shlomo Dubnov
- Abstract summary: HTS-AT is an audio transformer with a hierarchical structure to reduce the model size and training time.
It achieves better performance in event localization than the previous CNN-based models.
- Score: 43.50970305209596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio classification is an important task of mapping audio samples into their
corresponding labels. Recently, the transformer model with self-attention
mechanisms has been adopted in this field. However, existing audio transformers
require large GPU memories and long training time, meanwhile relying on
pretrained vision models to achieve high performance, which limits the model's
scalability in audio tasks. To combat these problems, we introduce HTS-AT: an
audio transformer with a hierarchical structure to reduce the model size and
training time. It is further combined with a token-semantic module to map final
outputs into class featuremaps, thus enabling the model for the audio event
detection (i.e. localization in time). We evaluate HTS-AT on three datasets of
audio classification where it achieves new state-of-the-art (SOTA) results on
AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves
better performance in event localization than the previous CNN-based models.
Moreover, HTS-AT requires only 35% model parameters and 15% training time of
the previous audio transformer. These results demonstrate the high performance
and high efficiency of HTS-AT.
Related papers
- Taming Data and Transformers for Audio Generation [49.54707963286065]
AutoCap is a high-quality and efficient automatic audio captioning model.
GenAu is a scalable transformer-based audio generation architecture.
We compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.