Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
- URL: http://arxiv.org/abs/2301.06267v5
- Date: Tue, 27 Aug 2024 19:00:47 GMT
- Title: Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
- Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan,
- Abstract summary: Humans use cross-modal information to learn new concepts efficiently.
We show that one can indeed build a better $bf visual$ dog classifier by reading about dogs and listening to them bark.
We construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
- Score: 69.31424345583537
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Few-Shot Learning with a Strong Teacher [36.35502703114652]
Few-shot learning aims to train a strong classifier using limited labeled examples.
Many existing works take the meta-learning approach, sampling few-shot tasks in turn and optimizing the few-shot learner's performance on classifying the query examples.
We propose a novel objective to directly train the few-shot learner to perform like a strong classifier.
arXiv Detail & Related papers (2021-07-01T03:20:46Z) - Mutual Modality Learning for Video Action Classification [74.83718206963579]
We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
arXiv Detail & Related papers (2020-11-04T21:20:08Z) - 'Less Than One'-Shot Learning: Learning N Classes From M<N Samples [13.70633147306388]
In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class.
We propose the less than one'-shot learning task where models must learn $N$ new classes given only $MN$ examples.
arXiv Detail & Related papers (2020-09-17T17:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.