Listen to Interpret: Post-hoc Interpretability for Audio Networks with
NMF
- URL: http://arxiv.org/abs/2202.11479v1
- Date: Wed, 23 Feb 2022 13:00:55 GMT
- Title: Listen to Interpret: Post-hoc Interpretability for Audio Networks with
NMF
- Authors: Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Florence
d'Alch\'e-Buc, Ga\"el Richard
- Abstract summary: We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision.
We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
- Score: 2.423660247459463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles post-hoc interpretability for audio processing networks.
Our goal is to interpret decisions of a network in terms of high-level audio
objects that are also listenable for the end-user. To this end, we propose a
novel interpreter design that incorporates non-negative matrix factorization
(NMF). In particular, a carefully regularized interpreter module is trained to
take hidden layer representations of the targeted network as input and produce
time activations of pre-learnt NMF components as intermediate outputs. Our
methodology allows us to generate intuitive audio-based interpretations that
explicitly enhance parts of the input signal most relevant for a network's
decision. We demonstrate our method's applicability on popular benchmarks,
including a real-world multi-label classification task.
Related papers
- Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.
We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z) - Listenable Maps for Audio Classifiers [13.596715710792528]
We introduce Listenable Maps for Audios (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations.
L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio.
We show that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies.
arXiv Detail & Related papers (2024-03-19T18:32:48Z) - Focal Modulation Networks for Interpretable Sound Classification [14.360545133618267]
This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets)
We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset.
Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability.
arXiv Detail & Related papers (2024-02-05T06:20:52Z) - An Explainable Proxy Model for Multiabel Audio Segmentation [1.7611027732647493]
We propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND) and overlapped speech detection (OSD) simultaneously.
Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features.
arXiv Detail & Related papers (2024-01-16T10:41:33Z) - Tackling Interpretability in Audio Classification Networks with
Non-negative Matrix Factorization [2.423660247459463]
This paper tackles two major problem settings for interpretability of audio processing networks.
For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user.
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
arXiv Detail & Related papers (2023-05-11T20:50:51Z) - Guiding the PLMs with Semantic Anchors as Intermediate Supervision:
Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network.
By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks.
We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z) - Robust Semantic Communications with Masked VQ-VAE Enabled Codebook [56.63571713657059]
We propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise.
To combat the semantic noise, the adversarial training with weight is developed to incorporate the samples with semantic noise in the training dataset.
We develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features.
arXiv Detail & Related papers (2022-06-08T16:58:47Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.