Are word boundaries useful for unsupervised language learning?
- URL: http://arxiv.org/abs/2210.02956v1
- Date: Thu, 6 Oct 2022 14:49:42 GMT
- Title: Are word boundaries useful for unsupervised language learning?
- Authors: Tu Anh Nguyen, Maureen de Seyssel, Robin Algayres, Patricia Roze, Ewan
Dunbar, Emmanuel Dupoux
- Abstract summary: Words provide at least two kinds of relevant information: boundary information and meaningful units.
We show that word boundary information may be absent or unreliable in the case of speech input.
We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm.
- Score: 13.049946284598935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word or word-fragment based Language Models (LM) are typically preferred over
character-based ones in many downstream applications. This may not be
surprising as words seem more linguistically relevant units than characters.
Words provide at least two kinds of relevant information: boundary information
and meaningful units. However, word boundary information may be absent or
unreliable in the case of speech input (word boundaries are not marked
explicitly in the speech stream). Here, we systematically compare LSTMs as a
function of the input unit (character, phoneme, word, word part), with or
without gold boundary information. We probe linguistic knowledge in the
networks at the lexical, syntactic and semantic levels using three
speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY,
pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and
28\% in relative performance depending on the task. We show that gold
boundaries can be replaced by automatically found ones obtained with an
unsupervised segmentation algorithm, and that even modest segmentation
performance gives a gain in performance on two of the three tasks compared to
basic character/phone based models without boundary information.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - XLS-R fine-tuning on noisy word boundaries for unsupervised speech
segmentation into words [13.783996617841467]
We fine-tune an XLS-R model to predict word boundaries produced by top-tier speech segmentation systems.
Our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.
arXiv Detail & Related papers (2023-10-08T17:05:00Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - DP-Parse: Finding Word Boundaries from Raw Speech with an Instance
Lexicon [18.05179713472479]
We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens.
On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages.
Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn and semantic representations as assessed by a new spoken word embedding benchmark.
arXiv Detail & Related papers (2022-06-22T19:15:57Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues.
Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features.
We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z) - Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech [24.187382590960254]
Children do not build their lexicon by segmenting spoken input into phonemes and then building up words from them.
This suggests that the ideal way of learning a language is by starting from full semantic units.
We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient.
arXiv Detail & Related papers (2020-06-15T13:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.