Visually grounded few-shot word acquisition with fewer shots
- URL: http://arxiv.org/abs/2305.15937v1
- Date: Thu, 25 May 2023 11:05:54 GMT
- Title: Visually grounded few-shot word acquisition with fewer shots
- Authors: Leanne Nortje, Benjamin van Niekerk, Herman Kamper
- Abstract summary: We propose a model that acquires new words and their visual depictions from just a few word-image example pairs.
We use a word-to-image attention mechanism to determine word-image similarity.
With this new model, we achieve better performance with fewer shots than any existing approach.
- Score: 26.114011076658237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a visually grounded speech model that acquires new words and their
visual depictions from just a few word-image example pairs. Given a set of test
images and a spoken query, we ask the model which image depicts the query word.
Previous work has simplified this problem by either using an artificial setting
with digit word-image pairs or by using a large number of examples per class.
We propose an approach that can work on natural word-image pairs but with less
examples, i.e. fewer shots. Our approach involves using the given word-image
example pairs to mine new unsupervised word-image training pairs from large
collections of unlabelled speech and images. Additionally, we use a
word-to-image attention mechanism to determine word-image similarity. With this
new model, we achieve better performance with fewer shots than any existing
approach.
Related papers
- Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - What does a platypus look like? Generating customized prompts for
zero-shot image classification [52.92839995002636]
This work introduces a simple method to generate higher accuracy prompts without relying on any explicit knowledge of the task domain.
We leverage the knowledge contained in large language models (LLMs) to generate many descriptive sentences that contain important discriminating characteristics of the image categories.
This approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet.
arXiv Detail & Related papers (2022-09-07T17:27:08Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Direct multimodal few-shot learning of speech and images [37.039034113884085]
We propose direct models that learn a shared embedding space of spoken words and images from only a few paired examples.
We show that the improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors.
arXiv Detail & Related papers (2020-12-10T14:06:57Z) - COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content
Conditioned Style Encoder [70.23358875904891]
Unsupervised image-to-image translation aims to learn a mapping of an image in a given domain to an analogous image in a different domain.
We propose a new few-shot image translation model, COCO-FUNIT, which computes the style embedding of the example images conditioned on the input image.
Our model shows effectiveness in addressing the content loss problem.
arXiv Detail & Related papers (2020-07-15T02:01:14Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z) - Discoverability in Satellite Imagery: A Good Sentence is Worth a
Thousand Pictures [0.0]
Small satellite constellations provide daily global coverage of the earth's landmass.
To extract text annotations from raw pixels requires two dependent machine learning models.
We evaluate seven models on the previously largest benchmark for satellite image captions.
arXiv Detail & Related papers (2020-01-03T20:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.