DP-Parse: Finding Word Boundaries from Raw Speech with an Instance
Lexicon
- URL: http://arxiv.org/abs/2206.11332v1
- Date: Wed, 22 Jun 2022 19:15:57 GMT
- Title: DP-Parse: Finding Word Boundaries from Raw Speech with an Instance
Lexicon
- Authors: Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Lauren\c{c}on,
Salah Zaiem, Abdelrahman Mohamed, Beno\^it Sagot, Emmanuel Dupoux
- Abstract summary: We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens.
On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages.
Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn and semantic representations as assessed by a new spoken word embedding benchmark.
- Score: 18.05179713472479
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finding word boundaries in continuous speech is challenging as there is
little or no equivalent of a 'space' delimiter between words. Popular Bayesian
non-parametric models for text segmentation use a Dirichlet process to jointly
segment sentences and build a lexicon of word types. We introduce DP-Parse,
which uses similar principles but only relies on an instance lexicon of word
tokens, avoiding the clustering errors that arise with a lexicon of word types.
On the Zero Resource Speech Benchmark 2017, our model sets a new speech
segmentation state-of-the-art in 5 languages. The algorithm monotonically
improves with better input representations, achieving yet higher scores when
fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can
be pipelined to a language model and learn semantic and syntactic
representations as assessed by a new spoken word embedding benchmark.
Related papers
- Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming [22.044042563954378]
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon.
Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon.
For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints.
arXiv Detail & Related papers (2024-09-22T15:16:43Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - XLS-R fine-tuning on noisy word boundaries for unsupervised speech
segmentation into words [13.783996617841467]
We fine-tune an XLS-R model to predict word boundaries produced by top-tier speech segmentation systems.
Our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.
arXiv Detail & Related papers (2023-10-08T17:05:00Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Cascading and Direct Approaches to Unsupervised Constituency Parsing on
Spoken Sentences [67.37544997614646]
We present the first study on unsupervised spoken constituency parsing.
The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees.
We show that accurate segmentation alone may be sufficient to parse spoken sentences accurately.
arXiv Detail & Related papers (2023-03-15T17:57:22Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging.
Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Augmenting Part-of-speech Tagging with Syntactic Information for
Vietnamese and Chinese [0.32228025627337864]
We implement the idea to improve word segmentation and part of speech tagging of the Vietnamese language by employing a simplified constituency.
Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based constituency.
This model can be augmented with predicted word boundary and part-of-speech tags by other tools.
arXiv Detail & Related papers (2021-02-24T08:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.