Related papers: Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

URL: http://arxiv.org/abs/2409.14486v1
Date: Sun, 22 Sep 2024 15:16:43 GMT
Title: Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming
Authors: Simon Malan, Benjamin van Niekerk, Herman Kamper,
Abstract summary: We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints.
Score: 22.044042563954378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.

Related papers

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation [90.35249276717038]
We propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction. A new decoder is designed to interpret extracted semantic features for final prediction.
arXiv Detail & Related papers (2024-06-17T03:49:47Z)
OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation [69.37484603556307]
Un Semantic segmenting (USS) involves segmenting images without relying on predefined labels. We introduce a novel approach called Optimally Matched Hierarchy (OMH) to simultaneously address the above issues. Our OMH yields better unsupervised segmentation performance compared to existing USS methods.
arXiv Detail & Related papers (2024-03-11T09:46:41Z)
Revisiting speech segmentation and lexicon learning with better features [29.268728666438495]
We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features.
arXiv Detail & Related papers (2024-01-31T15:06:34Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
SASFormer: Transformers for Sparsely Annotated Semantic Segmentation [44.758672633271956]
We propose a simple yet effective sparse annotated semantic segmentation framework based on segformer, dubbed SASFormer. Specifically, the framework first generates hierarchical patch attention maps, which are then multiplied by the network predictions to produce correlated regions separated by valid labels.
arXiv Detail & Related papers (2022-12-05T04:33:12Z)
Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z)
DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon [18.05179713472479]
We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn and semantic representations as assessed by a new spoken word embedding benchmark.
arXiv Detail & Related papers (2022-06-22T19:15:57Z)
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly. This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units. I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z)
Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings [81.09026586111811]
We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting. This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class. The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2022-02-04T07:19:09Z)
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE. We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z)
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.