Domain Generalization with Relaxed Instance Frequency-wise Normalization
for Multi-device Acoustic Scene Classification
- URL: http://arxiv.org/abs/2206.12513v1
- Date: Fri, 24 Jun 2022 23:45:50 GMT
- Title: Domain Generalization with Relaxed Instance Frequency-wise Normalization
for Multi-device Acoustic Scene Classification
- Authors: Byeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Juntae Lee,
Simyung Chang
- Abstract summary: Domain-relevant information in an audio feature is dominant in frequency statistics rather than channel statistics.
We introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play, explicit normalization module along the frequency axis.
RFN can eliminate instance-specific domain discrepancy in an audio feature while relaxing undesirable loss of useful discriminative information.
- Score: 18.186932959605247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While using two-dimensional convolutional neural networks (2D-CNNs) in image
processing, it is possible to manipulate domain information using channel
statistics, and instance normalization has been a promising way to get
domain-invariant features. However, unlike image processing, we analyze that
domain-relevant information in an audio feature is dominant in frequency
statistics rather than channel statistics. Motivated by our analysis, we
introduce Relaxed Instance Frequency-wise Normalization (RFN): a plug-and-play,
explicit normalization module along the frequency axis which can eliminate
instance-specific domain discrepancy in an audio feature while relaxing
undesirable loss of useful discriminative information. Empirically, simply
adding RFN to networks shows clear margins compared to previous domain
generalization approaches on acoustic scene classification and yields improved
robustness for multiple audio devices. Especially, the proposed RFN won the
DCASE2021 challenge TASK1A, low-complexity acoustic scene classification with
multiple devices, with a clear margin, and RFN is an extended work of our
technical report.
Related papers
- Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - FAN-Net: Fourier-Based Adaptive Normalization For Cross-Domain Stroke
Lesion Segmentation [17.150527504559594]
We propose a novel FAN-Net, a U-Net-based segmentation network incorporated with a Fourier-based adaptive normalization (FAN)
The experimental results on the ATLAS dataset, which consists of MR images from 9 sites, show the superior performance of the proposed FAN-Net compared with baseline methods.
arXiv Detail & Related papers (2023-04-23T06:58:21Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Few-shot One-class Domain Adaptation Based on Frequency for Iris
Presentation Attack Detection [33.41823375502942]
Iris presentation attack detection (PAD) has achieved remarkable success to ensure the reliability and security of iris recognition systems.
Most existing methods exploit discriminative features in the spatial domain and report outstanding performance under intra-dataset settings.
We propose a new domain adaptation setting called Few-shot One-class Domain Adaptation (FODA), where adaptation only relies on a limited number of target bonafide samples.
arXiv Detail & Related papers (2022-04-01T11:55:06Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - TBNet:Two-Stream Boundary-aware Network for Generic Image Manipulation
Localization [49.521622399483846]
We propose a novel end-to-end two-stream boundary-aware network (abbreviated as TBNet) for generic image manipulation localization.
The proposed TBNet can significantly outperform state-of-the-art generic image manipulation localization methods in terms of both MCC and F1.
arXiv Detail & Related papers (2021-08-10T08:22:05Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.