Related papers: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

URL: http://arxiv.org/abs/2012.07551v1
Date: Mon, 14 Dec 2020 14:17:33 GMT
Title: Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
Authors: Herman Kamper, Benjamin van Niekerk
Abstract summary: We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code.
Score: 36.23526776206862
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized method generally performs best. While results are only comparable to the state-of-the-art in some cases, in all tasks a reasonable competing approach is outperformed at a substantially lower bitrate.

Related papers

An Explainable Proxy Model for Multiabel Audio Segmentation [1.7611027732647493]
We propose an explainable multilabel segmentation model that solves speech activity (SAD), music (MD), noise (ND) and overlapped speech detection (OSD) simultaneously. Experiments conducted on two datasets show similar performances as the pre-trained black box model while showing strong explainability features.
arXiv Detail & Related papers (2024-01-16T10:41:33Z)
Unsupervised Universal Image Segmentation [59.0383635597103]
We propose an Unsupervised Universal model (U2Seg) adept at performing various image segmentation tasks. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models. We then self-train the model on these pseudo semantic labels, yielding substantial performance gains.
arXiv Detail & Related papers (2023-12-28T18:59:04Z)
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages. Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points. By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z)
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly. This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units. I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z)
SOLO: A Simple Framework for Instance Segmentation [84.00519148562606]
"instance categories" assigns categories to each pixel within an instance according to the instance's location. "SOLO" is a simple, direct, and fast framework for instance segmentation with strong performance. Our approach achieves state-of-the-art results for instance segmentation in terms of both speed and accuracy.
arXiv Detail & Related papers (2021-06-30T09:56:54Z)
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE. We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z)
Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition [102.75463782627791]
We take steps toward answering the question by exposing failures of existing semantic segmentation methods in the open visual world. Inspired by previous research on model falsification, we start from an arbitrarily large image set, and automatically sample a small image set by MAximizing the Discrepancy (MAD) between two segmentation methods. The selected images have the greatest potential in falsifying either (or both) of the two methods. A segmentation method, whose failures are more difficult to be exposed in the MAD competition, is considered better.
arXiv Detail & Related papers (2021-02-27T16:06:25Z)
Unifying Instance and Panoptic Segmentation with Dynamic Rank-1 Convolutions [109.2706837177222]
DR1Mask is the first panoptic segmentation framework that exploits a shared feature map for both instance and semantic segmentation. As a byproduct, DR1Mask is 10% faster and 1 point in mAP more accurate than previous state-of-the-art instance segmentation network BlendMask.
arXiv Detail & Related papers (2020-11-19T12:42:10Z)
Deep Variational Instance Segmentation [7.334808870313923]
State-of-the-art algorithms often employ two separate stages, the first one generating object proposals and the second one recognizing and refining the boundaries. We propose a novel algorithm that directly utilizes a fully convolutional network (FCN) to predict instance labels.
arXiv Detail & Related papers (2020-07-22T17:57:49Z)
Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition. We describe an efficient approach for end-to-end whole-word segmental models. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.