Device-Robust Acoustic Scene Classification via Impulse Response
Augmentation
- URL: http://arxiv.org/abs/2305.07499v2
- Date: Tue, 27 Jun 2023 08:43:13 GMT
- Title: Device-Robust Acoustic Scene Classification via Impulse Response
Augmentation
- Authors: Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer
- Abstract summary: We study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers.
Results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle.
We also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
- Score: 5.887969742827488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to generalize to a wide range of recording devices is a crucial
performance factor for audio classification models. The characteristics of
different types of microphones introduce distributional shifts in the digitized
audio signals due to their varying frequency responses. If this domain shift is
not taken into account during training, the model's performance could degrade
severely when it is applied to signals recorded by unseen devices. In
particular, training a model on audio signals recorded with a small number of
different microphones can make generalization to unseen devices difficult. To
tackle this problem, we convolve audio signals in the training set with
pre-recorded device impulse responses (DIRs) to artificially increase the
diversity of recording devices. We systematically study the effect of DIR
augmentation on the task of Acoustic Scene Classification using CNNs and Audio
Spectrogram Transformers. The results show that DIR augmentation in isolation
performs similarly to the state-of-the-art method Freq-MixStyle. However, we
also show that DIR augmentation and Freq-MixStyle are complementary, achieving
a new state-of-the-art performance on signals recorded by devices unseen during
training.
Related papers
- Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation [0.0]
We introduce a unified generative framework to enhance the resilience of sound event classification systems against device variability.
Our method outperforms the state-of-the-art method by 2.6% and reduces variability by 0.8% in macro-average F1 score.
arXiv Detail & Related papers (2024-10-23T23:10:09Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture [11.063156506583562]
We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy.
We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs.
Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
arXiv Detail & Related papers (2024-06-05T13:50:59Z) - Microphone Conversion: Mitigating Device Variability in Sound Event
Classification [0.0]
We introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN.
Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they were recorded on a different device.
arXiv Detail & Related papers (2024-01-12T21:59:01Z) - Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation [21.896817015593122]
MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
arXiv Detail & Related papers (2023-09-27T18:23:03Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Robust Feature Learning on Long-Duration Sounds for Acoustic Scene
Classification [54.57150493905063]
Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded.
We propose a robust feature learning (RFL) framework to train the CNN.
arXiv Detail & Related papers (2021-08-11T03:33:05Z) - Discriminative Singular Spectrum Classifier with Applications on
Bioacoustic Signal Recognition [67.4171845020675]
We present a bioacoustic signal classifier equipped with a discriminative mechanism to extract useful features for analysis and classification efficiently.
Unlike current bioacoustic recognition methods, which are task-oriented, the proposed model relies on transforming the input signals into vector subspaces.
The validity of the proposed method is verified using three challenging bioacoustic datasets containing anuran, bee, and mosquito species.
arXiv Detail & Related papers (2021-03-18T11:01:21Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.