Learning to Recognise Words using Visually Grounded Speech
- URL: http://arxiv.org/abs/2006.00512v1
- Date: Sun, 31 May 2020 12:48:37 GMT
- Title: Learning to Recognise Words using Visually Grounded Speech
- Authors: Sebastiaan Scholten, Danny Merkx, Odette Scharenborg
- Abstract summary: The model has been trained on pairs of images and spoken captions to create visually grounded embeddings.
We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents.
- Score: 15.972015648122914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigated word recognition in a Visually Grounded Speech model. The
model has been trained on pairs of images and spoken captions to create
visually grounded embeddings which can be used for speech to image retrieval
and vice versa. We investigate whether such a model can be used to recognise
words by embedding isolated words and using them to retrieve images of their
visual referents. We investigate the time-course of word recognition using a
gating paradigm and perform a statistical analysis to see whether well known
word competition effects in human speech processing influence word recognition.
Our experiments show that the model is able to recognise words, and the gating
paradigm reveals that words can be recognised from partial input as well and
that recognition is negatively influenced by word competition from the word
initial cohort.
Related papers
- A model of early word acquisition based on realistic-scale audiovisual naming events [10.047470656294333]
We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input.
We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that learns from statistical regularities in raw speech and pixel-level visual input.
Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants.
arXiv Detail & Related papers (2024-06-07T21:05:59Z) - Neighboring Words Affect Human Interpretation of Saliency Explanations [65.29015910991261]
Word-level saliency explanations are often used to communicate feature-attribution in text-based models.
Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores.
We investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation.
arXiv Detail & Related papers (2023-05-04T09:50:25Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Modelling word learning and recognition using visually grounded speech [18.136170489933082]
Computational models of speech recognition often assume that the set of target words is already given.
This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision.
Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input.
arXiv Detail & Related papers (2022-03-14T08:59:37Z) - Evaluating language-biased image classification based on semantic
representations [13.508894957080777]
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.
Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification.
arXiv Detail & Related papers (2022-01-26T15:46:36Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z) - Pho(SC)Net: An Approach Towards Zero-shot Word Image Recognition in
Historical Documents [2.502407331311937]
Zero-shot learning methods could aptly be used to recognize unseen/out-of-lexicon words in historical document images.
We propose a hybrid representation that considers the character's shape appearance to differentiate between two different words.
Experiments were conducted to examine the effectiveness of an embedding that has properties of both PHOS and PHOC.
arXiv Detail & Related papers (2021-05-31T16:22:33Z) - Fine-Grained Grounding for Multimodal Speech Recognition [49.01826387664443]
We propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals.
In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features.
arXiv Detail & Related papers (2020-10-05T23:06:24Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.