Related papers: Revisiting speech segmentation and lexicon learning with better features

Revisiting speech segmentation and lexicon learning with better features

URL: http://arxiv.org/abs/2401.17902v1
Date: Wed, 31 Jan 2024 15:06:34 GMT
Title: Revisiting speech segmentation and lexicon learning with better features
Authors: Herman Kamper, Benjamin van Niekerk
Abstract summary: We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features.
Score: 29.268728666438495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.

Related papers

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR. ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages. Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points. By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z)
Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead [3.579111205766969]
We present a hybrid approach that leverages both acoustic and language information to improve segmentation. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.
arXiv Detail & Related papers (2022-10-26T03:36:31Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring [23.822788597966646]
Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly. This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units. I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs.
arXiv Detail & Related papers (2022-02-24T07:02:56Z)
Adaptive Early-Learning Correction for Segmentation from Noisy Annotations [13.962891776039369]
We study the learning dynamics of deep segmentation networks trained on inaccurately-annotated data. We propose a new method for segmentation from noisy annotations with two key elements.
arXiv Detail & Related papers (2021-10-07T18:46:23Z)
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE. We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z)
Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition. We describe an efficient approach for end-to-end whole-word segmental models. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.