Related papers: Microphone Conversion: Mitigating Device Variability in Sound Event Classification

Microphone Conversion: Mitigating Device Variability in Sound Event Classification

URL: http://arxiv.org/abs/2401.06913v1
Date: Fri, 12 Jan 2024 21:59:01 GMT
Title: Microphone Conversion: Mitigating Device Variability in Sound Event Classification
Authors: Myeonghoon Ryu, Hongseok Oh, Suji Lee and Han Park
Abstract summary: We introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they were recorded on a different device.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this study, we introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. We also present a unique dataset to evaluate this method. As SEC systems become increasingly common, it is crucial that they work well with audio from diverse recording devices. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they are recorded on a different device. Our experiments show that our approach outperforms existing methods in generalization by 5.2% - 11.5% in weighted f1 score. Additionally, it surpasses the current methods in adaptability across diverse recording devices by achieving a 6.5% - 12.8% improvement in weighted f1 score.

Related papers

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
arXiv Detail & Related papers (2025-07-03T14:29:43Z)
Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation [0.0]
We introduce a unified generative framework to enhance the resilience of sound event classification systems against device variability. Our method outperforms the state-of-the-art method by 2.6% and reduces variability by 0.8% in macro-average F1 score.
arXiv Detail & Related papers (2024-10-23T23:10:09Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting. When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z)
Device-Robust Acoustic Scene Classification via Impulse Response Augmentation [5.887969742827488]
We study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. Results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. We also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
arXiv Detail & Related papers (2023-05-12T14:12:56Z)
Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization. Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z)
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy [69.24683717901262]
Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities. In this work, we explore different data augmentation methods to boost the training of SVS systems. To further stabilize the training, we introduce the cycle-consistent training strategy.
arXiv Detail & Related papers (2022-03-31T12:50:10Z)
Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN [41.88097793717185]
We propose a novel method named Multi-discriminators CycleGAN to reduce noise of input speech. We show that training multiple generators on homogeneous subset of the training data is better than training one generator on all the training data.
arXiv Detail & Related papers (2021-12-12T19:56:34Z)
Noise-resistant Deep Metric Learning with Ranking-based Instance Selection [59.286567680389766]
We propose a noise-resistant training technique for DML, which we name Probabilistic Ranking-based Instance Selection with Memory (PRISM) PRISM identifies noisy data in a minibatch using average similarity against image features extracted from several previous versions of the neural network. To alleviate the high computational cost brought by the memory bank, we introduce an acceleration method that replaces individual data points with the class centers.
arXiv Detail & Related papers (2021-03-30T03:22:17Z)
Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions. We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset. We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.