XLS-R fine-tuning on noisy word boundaries for unsupervised speech
segmentation into words
- URL: http://arxiv.org/abs/2310.05235v1
- Date: Sun, 8 Oct 2023 17:05:00 GMT
- Title: XLS-R fine-tuning on noisy word boundaries for unsupervised speech
segmentation into words
- Authors: Robin Algayres, Pablo Diego-Simon, Benoit Sagot, Emmanuel Dupoux
- Abstract summary: We fine-tune an XLS-R model to predict word boundaries produced by top-tier speech segmentation systems.
Our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.
- Score: 13.783996617841467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the absence of explicit word boundaries in the speech stream, the task
of segmenting spoken sentences into word units without text supervision is
particularly challenging. In this work, we leverage the most recent
self-supervised speech models that have proved to quickly adapt to new tasks
through fine-tuning, even in low resource conditions. Taking inspiration from
semi-supervised learning, we fine-tune an XLS-R model to predict word
boundaries themselves produced by top-tier speech segmentation systems: DPDP,
VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is fine-tuned, it is used to infer
new word boundary labels that are used in turn for another fine-tuning step.
Our method consistently improves the performance of each system and sets a new
state-of-the-art that is, on average 130% higher than the previous one as
measured by the F1 score on correctly discovered word tokens on five corpora
featuring different languages. Finally, our system can segment speech from
languages unseen during fine-tuning in a zero-shot fashion.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - DP-Parse: Finding Word Boundaries from Raw Speech with an Instance
Lexicon [18.05179713472479]
We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens.
On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages.
Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn and semantic representations as assessed by a new spoken word embedding benchmark.
arXiv Detail & Related papers (2022-06-22T19:15:57Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation [31.87235700253597]
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
arXiv Detail & Related papers (2021-04-02T11:43:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.