On the Difficulty of Segmenting Words with Attention
- URL: http://arxiv.org/abs/2109.10107v1
- Date: Tue, 21 Sep 2021 11:37:08 GMT
- Title: On the Difficulty of Segmenting Words with Attention
- Authors: Ramon Sanabria, Hao Tang, Sharon Goldwater
- Abstract summary: We show, however, that even on monolingual data this approach is brittle.
In experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task.
- Score: 32.97060026226872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word segmentation, the problem of finding word boundaries in speech, is of
interest for a range of tasks. Previous papers have suggested that for
sequence-to-sequence models trained on tasks such as speech translation or
speech recognition, attention can be used to locate and segment the words. We
show, however, that even on monolingual data this approach is brittle. In our
experiments with different input types, data sizes, and segmentation
algorithms, only models trained to predict phones from words succeed in the
task. Models trained to predict words from either phones or speech (i.e., the
opposite direction needed to generalize to new data), yield much worse results,
suggesting that attention-based segmentation is only useful in limited
scenarios.
Related papers
- Segmenting Messy Text: Detecting Boundaries in Text Derived from
Historical Newspaper Images [0.0]
We consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each.
In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other.
We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
arXiv Detail & Related papers (2023-12-20T05:17:06Z) - Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Word Order Does Matter (And Shuffled Language Models Know It) [9.990431777927421]
Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE.
We investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order.
arXiv Detail & Related papers (2022-03-21T14:10:15Z) - On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors.
We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation [17.003488045214972]
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
arXiv Detail & Related papers (2020-08-05T16:34:33Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z) - Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech [24.187382590960254]
Children do not build their lexicon by segmenting spoken input into phonemes and then building up words from them.
This suggests that the ideal way of learning a language is by starting from full semantic units.
We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient.
arXiv Detail & Related papers (2020-06-15T13:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.