Asca: less audio data is more insightful
- URL: http://arxiv.org/abs/2309.13373v1
- Date: Sat, 23 Sep 2023 13:24:06 GMT
- Title: Asca: less audio data is more insightful
- Authors: Xiang Li, Junhao Chen, Chao Li, Hongwu Lv
- Abstract summary: We introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet.
On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively.
The unique structure of our model enriches output, enabling generalization across various audio detection tasks.
- Score: 10.354385253247761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio recognition in specialized areas such as birdsong and submarine
acoustics faces challenges in large-scale pre-training due to the limitations
in available samples imposed by sampling environments and specificity
requirements. While the Transformer model excels in audio recognition, its
dependence on vast amounts of data becomes restrictive in resource-limited
settings. Addressing this, we introduce the Audio Spectrogram Convolution
Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid
architecture, novel network design, and attention techniques, further augmented
with data enhancement and regularization strategies. On the BirdCLEF2023 and
AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively,
significantly outperforming competing methods. The unique structure of our
model enriches output, enabling generalization across various audio detection
tasks. Our code can be found at https://github.com/LeeCiang/ASCA.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - Automatic Audio Captioning using Attention weighted Event based
Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z) - Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks [0.6445605125467573]
This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality.
The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio.
arXiv Detail & Related papers (2021-09-05T15:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.