Direct multimodal few-shot learning of speech and images
- URL: http://arxiv.org/abs/2012.05680v1
- Date: Thu, 10 Dec 2020 14:06:57 GMT
- Title: Direct multimodal few-shot learning of speech and images
- Authors: Leanne Nortje, Herman Kamper
- Abstract summary: We propose direct models that learn a shared embedding space of spoken words and images from only a few paired examples.
We show that the improvements are due to the combination of unsupervised and transfer learning in the direct models, and the absence of two-step compounding errors.
- Score: 37.039034113884085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose direct multimodal few-shot models that learn a shared embedding
space of spoken words and images from only a few paired examples. Imagine an
agent is shown an image along with a spoken word describing the object in the
picture, e.g. pen, book and eraser. After observing a few paired examples of
each class, the model is asked to identify the "book" in a set of unseen
pictures. Previous work used a two-step indirect approach relying on learned
unimodal representations: speech-speech and image-image comparisons are
performed across the support set of given speech-image pairs. We propose two
direct models which instead learn a single multimodal space where inputs from
different modalities are directly comparable: a multimodal triplet network
(MTriplet) and a multimodal correspondence autoencoder (MCAE). To train these
direct models, we mine speech-image pairs: the support set is used to pair up
unlabelled in-domain speech and images. In a speech-to-image digit matching
task, direct models outperform indirect models, with the MTriplet achieving the
best multimodal five-shot accuracy. We show that the improvements are due to
the combination of unsupervised and transfer learning in the direct models, and
the absence of two-step compounding errors.
Related papers
- An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation [43.139415423751615]
Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment.
A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task.
We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
arXiv Detail & Related papers (2024-08-16T10:33:19Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Visually grounded few-shot word acquisition with fewer shots [26.114011076658237]
We propose a model that acquires new words and their visual depictions from just a few word-image example pairs.
We use a word-to-image attention mechanism to determine word-image similarity.
With this new model, we achieve better performance with fewer shots than any existing approach.
arXiv Detail & Related papers (2023-05-25T11:05:54Z) - Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models [69.31424345583537]
Humans use cross-modal information to learn new concepts efficiently.
We show that one can indeed build a better $bf visual$ dog classifier by reading about dogs and listening to them bark.
We construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
arXiv Detail & Related papers (2023-01-16T05:40:42Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Unsupervised vs. transfer learning for multimodal one-shot matching of
speech and images [27.696096343873215]
We consider the task of multimodal one-shot speech-image matching.
In both unimodal and multimodal few-shot matching experiments, we find that transfer learning outperforms unsupervised training.
arXiv Detail & Related papers (2020-08-14T09:13:37Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.