Related papers: Few-shot target-driven instance detection based on open-vocabulary object detection models

Few-shot target-driven instance detection based on open-vocabulary object detection models

URL: http://arxiv.org/abs/2410.16028v1
Date: Mon, 21 Oct 2024 14:03:15 GMT
Title: Few-shot target-driven instance detection based on open-vocabulary object detection models
Authors: Ben Crulis, Barthelemy Serres, Cyril De Runz, Gilles Venturini,
Abstract summary: Open-vocabulary object detection models bring closer visual and textual concepts in the same latent space. We propose a lightweight method to turn the latter into a one-shot or few-shot object recognition models without requiring textual descriptions.
Score: 1.0749601922718608
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current large open vision models could be useful for one and few-shot object recognition. Nevertheless, gradient-based re-training solutions are costly. On the other hand, open-vocabulary object detection models bring closer visual and textual concepts in the same latent space, allowing zero-shot detection via prompting at small computational cost. We propose a lightweight method to turn the latter into a one-shot or few-shot object recognition models without requiring textual descriptions. Our experiments on the TEgO dataset using the YOLO-World model as a base show that performance increases with the model size, the number of examples and the use of image augmentation.

Related papers

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting [1.1871535995163365]
Textual Inversion (TI) allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples.<n>The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning.<n>We evaluate whether the method matches or outperforms the baseline methods that suffer from forgetting in a variety of quantitative and qualitative experiments.
arXiv Detail & Related papers (2025-08-07T12:28:08Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
Automatic Discovery of Visual Circuits [66.99553804855931]
We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept. We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
arXiv Detail & Related papers (2024-04-22T17:00:57Z)
Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture. A large-scale pre-trained model is used to build class-reference embeddings or prototypes. We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking. Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z)
One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category. We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z)
Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples. We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z)
Few-shot Weakly-Supervised Object Detection via Directional Statistics [55.97230224399744]
We propose a probabilistic multiple instance learning approach for few-shot Common Object Localization (COL) and few-shot Weakly Supervised Object Detection (WSOD) Our model simultaneously learns the distribution of the novel objects and localizes them via expectation-maximization steps. Our experiments show that the proposed method, despite being simple, outperforms strong baselines in few-shot COL and WSOD, as well as large-scale WSOD tasks.
arXiv Detail & Related papers (2021-03-25T22:34:16Z)
Few-shot Object Detection on Remote Sensing Images [11.40135025181393]
We introduce a few-shot learning-based method for object detection on remote sensing images. We build our few-shot object detection model upon YOLOv3 architecture and develop a multi-scale object detection framework.
arXiv Detail & Related papers (2020-06-14T07:18:10Z)
One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module. We also propose novel training strategies that effectively improve detection performance. Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.