Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech
- URL: http://arxiv.org/abs/2006.08387v2
- Date: Tue, 20 Oct 2020 13:15:59 GMT
- Title: Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech
- Authors: William N. Havard, Jean-Pierre Chevrot, Laurent Besacier
- Abstract summary: Children do not build their lexicon by segmenting spoken input into phonemes and then building up words from them.
This suggests that the ideal way of learning a language is by starting from full semantic units.
We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient.
- Score: 24.187382590960254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The language acquisition literature shows that children do not build their
lexicon by segmenting the spoken input into phonemes and then building up words
from them, but rather adopt a top-down approach and start by segmenting
word-like units and then break them down into smaller units. This suggests that
the ideal way of learning a language is by starting from full semantic units.
In this paper, we investigate if this is also the case for a neural model of
Visually Grounded Speech trained on a speech-image retrieval task. We evaluated
how well such a network is able to learn a reliable speech-to-image mapping
when provided with phone, syllable, or word boundary information. We present a
simple way to introduce such information into an RNN-based model and
investigate which type of boundary is the most efficient. We also explore at
which level of the network's architecture such information should be introduced
so as to maximise its performances. Finally, we show that using multiple
boundary types at once in a hierarchical structure, by which low-level segments
are used to recompose high-level segments, is beneficial and yields better
results than using low-level or high-level segments in isolation.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Segment and Caption Anything [126.20201216616137]
We propose a method to efficiently equip the Segment Anything Model with the ability to generate regional captions.
By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation.
We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice.
arXiv Detail & Related papers (2023-12-01T19:00:17Z) - Neural approaches to spoken content embedding [1.3706331473063877]
We contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs)
We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition.
arXiv Detail & Related papers (2023-08-28T21:16:08Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - On the Difficulty of Segmenting Words with Attention [32.97060026226872]
We show, however, that even on monolingual data this approach is brittle.
In experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task.
arXiv Detail & Related papers (2021-09-21T11:37:08Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.