Direct multimodal few-shot learning of speech and images
- URL: http://arxiv.org/abs/2012.05680v1
- Date: Thu, 10 Dec 2020 14:06:57 GMT
- Title: Direct multimodal few-shot learning of speech and images
- Authors: Leanne Nortje, Herman Kamper
- Abstract summary: We propose direct models that learn a shared embedding space of spoken words and images from only a few paired examples.
We show that the improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors.
- Score: 37.039034113884085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose direct multimodal few-shot models that learn a shared embedding
space of spoken words and images from only a few paired examples. Imagine an
agent is shown an image along with a spoken word describing the object in the
picture, e.g. pen, book and eraser. After observing a few paired examples of
each class, the model is asked to identify the "book" in a set of unseen
pictures. Previous work used a two-step indirect approach relying on learned
unimodal representations: speech-speech and image-image comparisons are
performed across the support set of given speech-image pairs. We propose two
direct models which instead learn a single multimodal space where inputs from
different modalities are directly comparable: a multimodal triplet network
(MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these
direct models, we mine speech-image pairs: the support set is used to pair up
unlabelled in-domain speech and images. In a speech-to-image digit matching
task, direct models outperform indirect models, with the MTriplet achieving the
best multimodal five-shot accuracy. We show that the improvements are due to
the combination of unsupervised and transfer learning in the direct models, and
the absence of two-step compounding errors.
Related papers
- Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Visually grounded few-shot word acquisition with fewer shots [26.114011076658237]
We propose a model that acquires new words and their visual depictions from just a few word-image example pairs.
We use a word-to-image attention mechanism to determine word-image similarity.
With this new model, we achieve better performance with fewer shots than any existing approach.
arXiv Detail & Related papers (2023-05-25T11:05:54Z) - Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with
Multimodal Models [61.97890177840515]
Humans use cross-modal information to learn new concepts efficiently.
We propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities.
arXiv Detail & Related papers (2023-01-16T05:40:42Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Unsupervised vs. transfer learning for multimodal one-shot matching of
speech and images [27.696096343873215]
We consider the task of multimodal one-shot speech-image matching.
In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised training.
arXiv Detail & Related papers (2020-08-14T09:13:37Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.