Towards unsupervised phone and word segmentation using self-supervised
vector-quantized neural networks
- URL: http://arxiv.org/abs/2012.07551v1
- Date: Mon, 14 Dec 2020 14:17:33 GMT
- Title: Towards unsupervised phone and word segmentation using self-supervised
vector-quantized neural networks
- Authors: Herman Kamper, Benjamin van Niekerk
- Abstract summary: We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision.
We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code.
- Score: 36.23526776206862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate segmenting and clustering speech into low-bitrate phone-like
sequences without supervision. We specifically constrain pretrained
self-supervised vector-quantized (VQ) neural networks so that blocks of
contiguous feature vectors are assigned to the same code, thereby giving a
variable-rate segmentation of the speech into discrete units. Two segmentation
methods are considered. In the first, features are greedily merged until a
prespecified number of segments are reached. The second uses dynamic
programming to optimize a squared error with a penalty term to encourage fewer
but longer segments. We show that these VQ segmentation methods can be used
without alteration across a wide range of tasks: unsupervised phone
segmentation, ABX phone discrimination, same-different word discrimination, and
as inputs to a symbolic word segmentation algorithm. The penalized method
generally performs best. While results are only comparable to the
state-of-the-art in some cases, in all tasks a reasonable competing approach is
outperformed at a substantially lower bitrate.
Related papers
- An Explainable Proxy Model for Multiabel Audio Segmentation [1.7611027732647493]
We propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND) and overlapped speech detection (OSD) simultaneously.
Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features.
arXiv Detail & Related papers (2024-01-16T10:41:33Z) - Unsupervised Universal Image Segmentation [59.0383635597103]
We propose an Unsupervised Universal model (U2Seg) adept at performing various image segmentation tasks.
U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models.
We then self-train the model on these pseudo semantic labels, yielding substantial performance gains.
arXiv Detail & Related papers (2023-12-28T18:59:04Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - Word Segmentation on Discovered Phone Units with Dynamic Programming and
Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly.
This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units.
I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z) - SOLO: A Simple Framework for Instance Segmentation [84.00519148562606]
"instance categories" assigns categories to each pixel within an instance according to the instance's location.
"SOLO" is a simple, direct, and fast framework for instance segmentation with strong performance.
Our approach achieves state-of-the-art results for instance segmentation in terms of both speed and accuracy.
arXiv Detail & Related papers (2021-06-30T09:56:54Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Exposing Semantic Segmentation Failures via Maximum Discrepancy
Competition [102.75463782627791]
We take steps toward answering the question by exposing failures of existing semantic segmentation methods in the open visual world.
Inspired by previous research on model falsification, we start from an arbitrarily large image set, and automatically sample a small image set by MAximizing the Discrepancy (MAD) between two segmentation methods.
The selected images have the greatest potential in falsifying either (or both) of the two methods.
A segmentation method, whose failures are more difficult to be exposed in the MAD competition, is considered better.
arXiv Detail & Related papers (2021-02-27T16:06:25Z) - Unifying Instance and Panoptic Segmentation with Dynamic Rank-1
Convolutions [109.2706837177222]
DR1Mask is the first panoptic segmentation framework that exploits a shared feature map for both instance and semantic segmentation.
As a byproduct, DR1Mask is 10% faster and 1 point in mAP more accurate than previous state-of-the-art instance segmentation network BlendMask.
arXiv Detail & Related papers (2020-11-19T12:42:10Z) - Deep Variational Instance Segmentation [7.334808870313923]
State-of-the-art algorithms often employ two separate stages, the first one generating object proposals and the second one recognizing and refining the boundaries.
We propose a novel algorithm that directly utilizes a fully convolutional network (FCN) to predict instance labels.
arXiv Detail & Related papers (2020-07-22T17:57:49Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.