An Explainable Proxy Model for Multiabel Audio Segmentation
- URL: http://arxiv.org/abs/2401.08268v2
- Date: Wed, 17 Jan 2024 13:28:04 GMT
- Title: An Explainable Proxy Model for Multiabel Audio Segmentation
- Authors: Th\'eo Mariotte and Antonio Almud\'evar and Marie Tahon and Alfonso
Ortega
- Abstract summary: We propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND) and overlapped speech detection (OSD) simultaneously.
Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features.
- Score: 1.7611027732647493
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio signal segmentation is a key task for automatic audio indexing. It
consists of detecting the boundaries of class-homogeneous segments in the
signal. In many applications, explainable AI is a vital process for
transparency of decision-making with machine learning. In this paper, we
propose an explainable multilabel segmentation model that solves speech
activity (SAD), music (MD), noise (ND), and overlapped speech detection (OSD)
simultaneously. This proxy uses the non-negative matrix factorization (NMF) to
map the embedding used for the segmentation to the frequency domain.
Experiments conducted on two datasets show similar performances as the
pre-trained black box model while showing strong explainability features.
Specifically, the frequency bins used for the decision can be easily identified
at both the segment level (local explanations) and global level (class
prototypes).
Related papers
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete
Diffusion Process [102.18226145874007]
We propose a model-agnostic solution called SegRefiner to enhance the quality of object masks produced by different segmentation models.
SegRefiner takes coarse masks as inputs and refines them using a discrete diffusion process.
It consistently improves both the segmentation metrics and boundary metrics across different types of coarse masks.
arXiv Detail & Related papers (2023-12-19T18:53:47Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Tackling Interpretability in Audio Classification Networks with
Non-negative Matrix Factorization [2.423660247459463]
This paper tackles two major problem settings for interpretability of audio processing networks.
For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user.
We propose a novel interpreter design that incorporates non-negative matrix factorization (NMF)
arXiv Detail & Related papers (2023-05-11T20:50:51Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Towards unsupervised phone and word segmentation using self-supervised
vector-quantized neural networks [36.23526776206862]
We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision.
We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code.
arXiv Detail & Related papers (2020-12-14T14:17:33Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.