Related papers: Revisiting Few-Shot Object Detection with Vision-Language Models

Revisiting Few-Shot Object Detection with Vision-Language Models

URL: http://arxiv.org/abs/2312.14494v3
Date: Fri, 14 Jun 2024 14:09:29 GMT
Title: Revisiting Few-Shot Object Detection with Vision-Language Models
Authors: Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan,
Abstract summary: We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs) We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets. We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
Score: 49.79495118650838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The era of vision-language models (VLMs) trained on large web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundational models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.9 mAP!

Related papers

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment [6.614005142754584]
Universal Sparse Autoencoders (USAEs) are a framework for uncovering and aligning interpretable concepts spanning multiple deep neural networks. USAEs learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once.
arXiv Detail & Related papers (2025-02-06T02:06:16Z)
An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set [0.0]
Under default settings, Human-Object Interaction (HOI) performance is nearly saturated. This study uses two experimental settings: grounding truth and random arbitrary combinations. We find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized.
arXiv Detail & Related papers (2024-08-11T13:40:02Z)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking. Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z)
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples [61.66967790884943]
Referring video object segmentation (RVOS) relies on sufficient data for a given scene. In more realistic scenarios, only minimal annotations are available for a new scene. We propose a model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios.
arXiv Detail & Related papers (2023-09-05T08:34:23Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark [33.86872697028233]
We present an in-depth study on few-shot video classification by making three contributions. First, we perform a consistent comparative study on the existing metric-based methods to figure out their limitations in representation learning. Second, we discover that there is a high correlation between the novel action class and the ImageNet object class, which is problematic in the few-shot recognition setting. Third, we present a new benchmark with more base data to facilitate future few-shot video classification without pre-training.
arXiv Detail & Related papers (2021-10-24T06:01:46Z)
Few-shot Weakly-Supervised Object Detection via Directional Statistics [55.97230224399744]
We propose a probabilistic multiple instance learning approach for few-shot Common Object Localization (COL) and few-shot Weakly Supervised Object Detection (WSOD) Our model simultaneously learns the distribution of the novel objects and localizes them via expectation-maximization steps. Our experiments show that the proposed method, despite being simple, outperforms strong baselines in few-shot COL and WSOD, as well as large-scale WSOD tasks.
arXiv Detail & Related papers (2021-03-25T22:34:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.