Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting
- URL: http://arxiv.org/abs/2204.07841v3
- Date: Mon, 27 Mar 2023 15:40:57 GMT
- Title: Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting
- Authors: Guangxing Han, Long Chen, Jiawei Ma, Shiyuan Huang, Rama Chellappa,
Shih-Fu Chang
- Abstract summary: We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
- Score: 77.69172089359606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study multi-modal few-shot object detection (FSOD) in this paper, using
both few-shot visual examples and class semantic information for detection,
which are complementary to each other by definition. Most of the previous works
on multi-modal FSOD are fine-tuning-based which are inefficient for online
applications. Moreover, these methods usually require expertise like class
names to extract class semantic embedding, which are hard to get for rare
classes. Our approach is motivated by the high-level conceptual similarity of
(metric-based) meta-learning and prompt-based learning to learn generalizable
few-shot and zero-shot object detection models respectively without
fine-tuning. Specifically, we combine the few-shot visual classifier and text
classifier learned via meta-learning and prompt-based learning respectively to
build the multi-modal classifier and detection models. In addition, to fully
exploit the pre-trained language models, we propose meta-learning-based
cross-modal prompting to generate soft prompts for novel classes present in
few-shot visual examples, which are then used to learn the text classifier.
Knowledge distillation is introduced to learn the soft prompt generator without
using human prior knowledge of class names, which may not be available for rare
classes. Our insight is that the few-shot support images naturally include
related context information and semantics of the class. We comprehensively
evaluate the proposed multi-modal FSOD models on multiple few-shot object
detection benchmarks, achieving promising results.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization [35.39571632348391]
Few-shot learning aims to learn representations that can tackle novel tasks.
Recent studies show that cross-modal learning can improve representations for few-shot classification.
Language is a rich modality that can be used to guide visual learning.
arXiv Detail & Related papers (2024-05-29T04:29:12Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models? [14.582209994281374]
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
arXiv Detail & Related papers (2023-07-09T08:07:43Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Meta Learning to Bridge Vision and Language Models for Multimodal
Few-Shot Learning [38.37682598345653]
We introduce a multimodal meta-learning approach to bridge the gap between vision and language models.
We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models.
We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words.
arXiv Detail & Related papers (2023-02-28T17:46:18Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.