Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with
Multimodal Models
- URL: http://arxiv.org/abs/2301.06267v4
- Date: Thu, 3 Aug 2023 01:56:35 GMT
- Title: Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with
Multimodal Models
- Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan
- Abstract summary: Humans use cross-modal information to learn new concepts efficiently.
We propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities.
- Score: 61.97890177840515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to quickly learn a new task with minimal instruction - known as
few-shot learning - is a central aspect of intelligent agents. Classical
few-shot benchmarks make use of few-shot samples from a single modality, but
such samples may not be sufficient to characterize an entire concept class. In
contrast, humans use cross-modal information to learn new concepts efficiently.
In this work, we demonstrate that one can indeed build a better ${\bf visual}$
dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them
bark. To do so, we exploit the fact that recent multimodal foundation models
such as CLIP are inherently cross-modal, mapping different modalities to the
same representation space. Specifically, we propose a simple cross-modal
adaptation approach that learns from few-shot examples spanning different
modalities. By repurposing class names as additional one-shot training samples,
we achieve SOTA results with an embarrassingly simple linear classifier for
vision-language adaptation. Furthermore, we show that our approach can benefit
existing methods such as prefix tuning, adapters, and classifier ensembling.
Finally, to explore other modalities beyond vision and language, we construct
the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal
training to improve the performance of both image and audio classification.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Meta Learning to Bridge Vision and Language Models for Multimodal
Few-Shot Learning [38.37682598345653]
We introduce a multimodal meta-learning approach to bridge the gap between vision and language models.
We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models.
We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words.
arXiv Detail & Related papers (2023-02-28T17:46:18Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Few-Shot Learning with a Strong Teacher [36.35502703114652]
Few-shot learning aims to train a strong classifier using limited labeled examples.
Many existing works take the meta-learning approach, sampling few-shot tasks in turn and optimizing the few-shot learner's performance on classifying the query examples.
We propose a novel objective to directly train the few-shot learner to perform like a strong classifier.
arXiv Detail & Related papers (2021-07-01T03:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.