Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
- URL: http://arxiv.org/abs/2007.13033v1
- Date: Sun, 26 Jul 2020 00:04:17 GMT
- Title: Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery
- Authors: Saurabhchand Bhati, Jes\'us Villalba, Piotr \.Zelasko, Najim Dehak
- Abstract summary: Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels.
We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments.
We achieve this via a self-expressing autoencoder framework.
- Score: 37.36402108843216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised spoken term discovery consists of two tasks: finding the
acoustic segment boundaries and labeling acoustically similar segments with the
same labels. We perform segmentation based on the assumption that the frame
feature vectors are more similar within a segment than across the segments.
Therefore, for strong segmentation performance, it is crucial that the features
represent the phonetic properties of a frame more than other factors of
variability. We achieve this via a self-expressing autoencoder framework. It
consists of a single encoder and two decoders with shared weights. The encoder
projects the input features into a latent representation. One of the decoders
tries to reconstruct the input from these latent representations and the other
from the self-expressed version of them. We use the obtained features to
segment and cluster the speech data. We evaluate the performance of the
proposed method in the Zero Resource 2020 challenge unit discovery task. The
proposed system consistently outperforms the baseline, demonstrating the
usefulness of the method in learning representations.
Related papers
- DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut [62.63481844384229]
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks.
In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method.
Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.
arXiv Detail & Related papers (2024-06-05T01:32:31Z) - Revisiting speech segmentation and lexicon learning with better features [29.268728666438495]
We revisit a self-supervised method that segments unlabelled speech into word-like segments.
We start from the two-stage duration-penalised dynamic programming method.
In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT.
After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features.
arXiv Detail & Related papers (2024-01-31T15:06:34Z) - An Explainable Proxy Model for Multiabel Audio Segmentation [1.7611027732647493]
We propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND) and overlapped speech detection (OSD) simultaneously.
Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features.
arXiv Detail & Related papers (2024-01-16T10:41:33Z) - Triple-View Knowledge Distillation for Semi-Supervised Semantic
Segmentation [54.23510028456082]
We propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation.
The framework includes the triple-view encoder and the dual-frequency decoder.
arXiv Detail & Related papers (2023-09-22T01:02:21Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - Weakly Supervised Training of Hierarchical Attention Networks for
Speaker Identification [37.33388614967888]
A hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally.
arXiv Detail & Related papers (2020-05-15T22:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.