Focal Modulation Networks for Interpretable Sound Classification
- URL: http://arxiv.org/abs/2402.02754v1
- Date: Mon, 5 Feb 2024 06:20:52 GMT
- Title: Focal Modulation Networks for Interpretable Sound Classification
- Authors: Luca Della Libera, Cem Subakan, Mirco Ravanelli
- Abstract summary: This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets)
We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset.
Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability.
- Score: 14.360545133618267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing success of deep neural networks has raised concerns about
their inherent black-box nature, posing challenges related to interpretability
and trust. While there has been extensive exploration of interpretation
techniques in vision and language, interpretability in the audio domain has
received limited attention, primarily focusing on post-hoc explanations. This
paper addresses the problem of interpretability by-design in the audio domain
by utilizing the recently proposed attention-free focal modulation networks
(FocalNets). We apply FocalNets to the task of environmental sound
classification for the first time and evaluate their interpretability
properties on the popular ESC-50 dataset. Our method outperforms a similarly
sized vision transformer both in terms of accuracy and interpretability.
Furthermore, it is competitive against PIQ, a method specifically designed for
post-hoc interpretation in the audio domain.
Related papers
- Reasoning with the Theory of Mind for Pragmatic Semantic Communication [62.87895431431273]
A pragmatic semantic communication framework is proposed in this paper.
It enables effective goal-oriented information sharing between two-intelligent agents.
Numerical evaluations demonstrate the framework's ability to achieve efficient communication with a reduced amount of bits.
arXiv Detail & Related papers (2023-11-30T03:36:19Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Tackling Interpretability in Audio Classification Networks with
Non-negative Matrix Factorization [2.423660247459463]
This paper tackles two major problem settings for interpretability of audio processing networks.
For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user.
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
arXiv Detail & Related papers (2023-05-11T20:50:51Z) - Listen to Interpret: Post-hoc Interpretability for Audio Networks with
NMF [2.423660247459463]
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision.
We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
arXiv Detail & Related papers (2022-02-23T13:00:55Z) - Interpreting deep urban sound classification using Layer-wise Relevance
Propagation [5.177947445379688]
This work focuses on the sensitive application of assisting drivers suffering from hearing loss by constructing a deep neural network for urban sound classification.
We use two different representations of audio signals, i.e. Mel and constant-Q spectrograms, while the decisions made by the deep neural network are explained via layer-wise relevance propagation.
Overall, we present an explainable AI framework for understanding deep urban sound classification.
arXiv Detail & Related papers (2021-11-19T14:15:45Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Contextual Interference Reduction by Selective Fine-Tuning of Neural
Networks [1.0152838128195465]
We study the role of the context on interfering with a disentangled foreground target object representation.
We work on a framework that benefits from the bottom-up and top-down processing paradigms.
arXiv Detail & Related papers (2020-11-21T20:11:12Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks [29.821666380496637]
HiFi-GAN transforms recorded speech to sound as though it had been recorded in a studio.
It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
arXiv Detail & Related papers (2020-06-10T07:24:39Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.