Tackling Interpretability in Audio Classification Networks with
Non-negative Matrix Factorization
- URL: http://arxiv.org/abs/2305.07132v1
- Date: Thu, 11 May 2023 20:50:51 GMT
- Title: Tackling Interpretability in Audio Classification Networks with
Non-negative Matrix Factorization
- Authors: Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Ga\"el Richard,
Florence d'Alch\'e-Buc
- Abstract summary: This paper tackles two major problem settings for interpretability of audio processing networks.
For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user.
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
- Score: 2.423660247459463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles two major problem settings for interpretability of audio
processing networks, post-hoc and by-design interpretation. For post-hoc
interpretation, we aim to interpret decisions of a network in terms of
high-level audio objects that are also listenable for the end-user. This is
extended to present an inherently interpretable model with high performance. To
this end, we propose a novel interpreter design that incorporates non-negative
matrix factorization (NMF). In particular, an interpreter is trained to
generate a regularized intermediate embedding from hidden layers of a target
network, learnt as time-activations of a pre-learnt NMF dictionary. Our
methodology allows us to generate intuitive audio-based interpretations that
explicitly enhance parts of the input signal most relevant for a network's
decision. We demonstrate our method's applicability on a variety of
classification tasks, including multi-label data for real-world audio and
music.
Related papers
- Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.
We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z) - Listenable Maps for Audio Classifiers [13.596715710792528]
We introduce Listenable Maps for Audios (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations.
L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio.
We show that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies.
arXiv Detail & Related papers (2024-03-19T18:32:48Z) - Focal Modulation Networks for Interpretable Sound Classification [14.360545133618267]
This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets)
We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset.
Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability.
arXiv Detail & Related papers (2024-02-05T06:20:52Z) - Guiding the PLMs with Semantic Anchors as Intermediate Supervision:
Towards Interpretable Semantic Parsing [57.11806632758607]
We propose to incorporate the current pretrained language models with a hierarchical decoder network.
By taking the first-principle structures as the semantic anchors, we propose two novel intermediate supervision tasks.
We conduct intensive experiments on several semantic parsing benchmarks and demonstrate that our approach can consistently outperform the baselines.
arXiv Detail & Related papers (2022-10-04T07:27:29Z) - Robust Semantic Communications with Masked VQ-VAE Enabled Codebook [56.63571713657059]
We propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise.
To combat the semantic noise, the adversarial training with weight is developed to incorporate the samples with semantic noise in the training dataset.
We develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features.
arXiv Detail & Related papers (2022-06-08T16:58:47Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Listen to Interpret: Post-hoc Interpretability for Audio Networks with
NMF [2.423660247459463]
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision.
We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
arXiv Detail & Related papers (2022-02-23T13:00:55Z) - Sparse Mixture of Local Experts for Efficient Speech Enhancement [19.645016575334786]
We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity.
Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
arXiv Detail & Related papers (2020-05-16T23:23:22Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.