Modelling word learning and recognition using visually grounded speech
- URL: http://arxiv.org/abs/2203.06937v1
- Date: Mon, 14 Mar 2022 08:59:37 GMT
- Title: Modelling word learning and recognition using visually grounded speech
- Authors: Danny Merkx, Sebastiaan Scholten, Stefan L. Frank, Mirjam Ernestus and
Odette Scharenborg
- Abstract summary: Computational models of speech recognition often assume that the set of target words is already given.
This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision.
Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input.
- Score: 18.136170489933082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Computational models of speech recognition often assume that the
set of target words is already given. This implies that these models do not
learn to recognise speech from scratch without prior knowledge and explicit
supervision. Visually grounded speech models learn to recognise speech without
prior knowledge by exploiting statistical dependencies between spoken and
visual input. While it has previously been shown that visually grounded speech
models learn to recognise the presence of words in the input, we explicitly
investigate such a model as a model of human speech recognition.
Methods: We investigate the time-course of word recognition as simulated by
the model using a gating paradigm to test whether its recognition is affected
by well-known word-competition effects in human speech processing. We
furthermore investigate whether vector quantisation, a technique for discrete
representation learning, aids the model in the discovery and recognition of
words.
Results/Conclusion: Our experiments show that the model is able to recognise
nouns in isolation and even learns to properly differentiate between plural and
singular nouns. We also find that recognition is influenced by word competition
from the word-initial cohort and neighbourhood density, mirroring word
competition effects in human speech comprehension. Lastly, we find no evidence
that vector quantisation is helpful in discovering and recognising words. Our
gating experiments even show that the vector quantised model requires more of
the input sequence for correct recognition.
Related papers
- Visually Grounded Speech Models have a Mutual Exclusivity Bias [20.495178526318185]
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias.
This bias has been studied computationally, but only in models that use discrete word representations as input.
We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio.
arXiv Detail & Related papers (2024-03-20T18:49:59Z) - Identifying and interpreting non-aligned human conceptual
representations using language modeling [0.0]
We show that congenital blindness induces conceptual reorganization in both a-modal and sensory-related verbal domains.
We find that blind individuals more strongly associate social and cognitive meanings to verbs related to motion.
For some verbs, representations of blind and sighted are highly similar.
arXiv Detail & Related papers (2024-03-10T13:02:27Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - The neural dynamics of auditory word recognition and integration [21.582292050622456]
We present a computational model of word recognition which formalizes this perceptual process in Bayesian decision theory.
We fit this model to explain scalp EEG signals recorded as subjects passively listened to a fictional story.
The model reveals distinct neural processing of words depending on whether or not they can be quickly recognized.
arXiv Detail & Related papers (2023-05-22T18:06:32Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Hearings and mishearings: decrypting the spoken word [0.0]
We propose a model of the speech perception of individual words in the presence of mishearings.
We show for instance that speech perception is easy when the word length is less than a threshold, to be identified with a static transition.
We extend this to the dynamics of word recognition, proposing an intuitive approach highlighting the distinction between individual, isolated mishearings and clusters of contiguous mishearings.
arXiv Detail & Related papers (2020-09-01T13:58:51Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Learning to Recognise Words using Visually Grounded Speech [15.972015648122914]
The model has been trained on pairs of images and spoken captions to create visually grounded embeddings.
We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents.
arXiv Detail & Related papers (2020-05-31T12:48:37Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.