Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation
- URL: http://arxiv.org/abs/2106.02170v1
- Date: Thu, 3 Jun 2021 23:12:05 GMT
- Title: Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation
- Authors: Saurabhchand Bhati, Jes\'us Villalba, Piotr \.Zelasko, Laureano
Moro-Velazquez, Najim Dehak
- Abstract summary: We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
- Score: 33.35220574193796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic detection of phoneme or word-like units is one of the core
objectives in zero-resource speech processing. Recent attempts employ
self-supervised training methods, such as contrastive predictive coding (CPC),
where the next frame is predicted given past context. However, CPC only looks
at the audio signal's frame-level structure. We overcome this limitation with a
segmental contrastive predictive coding (SCPC) framework that can model the
signal structure at a higher level e.g. at the phoneme level. In this
framework, a convolutional neural network learns frame-level representation
from the raw waveform via noise-contrastive estimation (NCE). A differentiable
boundary detector finds variable-length segments, which are then used to
optimize a segment encoder via NCE to learn segment representations. The
differentiable boundary detector allows us to train frame-level and
segment-level encoders jointly. Typically, phoneme and word segmentation are
treated as separate tasks. We unify them and experimentally show that our
single model outperforms existing phoneme and word segmentation methods on
TIMIT and Buckeye datasets. We analyze the impact of boundary threshold and
when is the right time to include the segmental loss in the learning process.
Related papers
- REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Revisiting speech segmentation and lexicon learning with better features [29.268728666438495]
We revisit a self-supervised method that segments unlabelled speech into word-like segments.
We start from the two-stage duration-penalised dynamic programming method.
In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT.
After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features.
arXiv Detail & Related papers (2024-01-31T15:06:34Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Word Segmentation on Discovered Phone Units with Dynamic Programming and
Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly.
This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units.
I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z) - Towards unsupervised phone and word segmentation using self-supervised
vector-quantized neural networks [36.23526776206862]
We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision.
We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code.
arXiv Detail & Related papers (2020-12-14T14:17:33Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.