Related papers: Sampling Bag of Views for Open-Vocabulary Object Detection

Sampling Bag of Views for Open-Vocabulary Object Detection

URL: http://arxiv.org/abs/2412.18273v1
Date: Tue, 24 Dec 2024 08:32:38 GMT
Title: Sampling Bag of Views for Open-Vocabulary Object Detection
Authors: Hojun Choi, Junsuk Choe, Hyunjung Shim,
Abstract summary: We propose a concept-based alignment method that samples a more powerful and efficient compositional structure.<n>Our method achieves improvements of 2.6 box AP50 and 0.5 mask AP over prior work on novel categories in the open-vocabulary COCO and LVIS benchmarks.
Score: 22.001826330679233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing open-vocabulary object detection (OVD) develops methods for testing unseen categories by aligning object region embeddings with corresponding VLM features. A recent study leverages the idea that VLMs implicitly learn compositional structures of semantic concepts within the image. Instead of using an individual region embedding, it utilizes a bag of region embeddings as a new representation to incorporate compositional structures into the OVD task. However, this approach often fails to capture the contextual concepts of each region, leading to noisy compositional structures. This results in only marginal performance improvements and reduced efficiency. To address this, we propose a novel concept-based alignment method that samples a more powerful and efficient compositional structure. Our approach groups contextually related ``concepts'' into a bag and adjusts the scale of concepts within the bag for more effective embedding alignment. Combined with Faster R-CNN, our method achieves improvements of 2.6 box AP50 and 0.5 mask AP over prior work on novel categories in the open-vocabulary COCO and LVIS benchmarks. Furthermore, our method reduces CLIP computation in FLOPs by 80.3% compared to previous research, significantly enhancing efficiency. Experimental results demonstrate that the proposed method outperforms previous state-of-the-art models on the OVD datasets.

Related papers

FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion [18.996022873991596]
We present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models.<n> FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories.<n>In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network.
arXiv Detail & Related papers (2026-02-03T05:45:22Z)
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation [29.809079908218607]
This work introduces a fresh solution to reinforce base pseudo-labels and facilitate target-prompt learning.<n>We first propose to leverage the reference predictions based on the relationship between source and target visual embeddings.<n>We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models.
arXiv Detail & Related papers (2025-06-13T06:33:27Z)
Topology-Aware CLIP Few-Shot Learning [0.0]
We introduce a topology-aware tuning approach integrating Representation Topology Divergence into the Task Residual framework.<n>By explicitly aligning the topological structures of visual and text representations using a combined RTD and Cross-Entropy loss, our method enhances few-shot performance.
arXiv Detail & Related papers (2025-05-03T04:58:29Z)
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection [11.497620257835964]
We propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from vision-language models (VLMs) CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of overhead.
arXiv Detail & Related papers (2025-03-14T02:04:28Z)
Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. We present a novel approach for exploiting the multilayered nature of pretrained models for ASV. We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z)
SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection [2.0755366440393743]
Confusion and forgetting of object classes have been challenges of prime interest in Few-Shot Object Detection (FSOD) We introduce a novel Submodular Mutual Information Learning framework which adopts mutual information functions. Our proposed approach generalizes to several existing approaches in FSOD, agnostic of the backbone architecture.
arXiv Detail & Related papers (2024-07-02T20:53:43Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
ConvBLS: An Effective and Efficient Incremental Convolutional Broad Learning System for Image Classification [63.49762079000726]
We propose a convolutional broad learning system (ConvBLS) based on the spherical K-means (SKM) algorithm and two-stage multi-scale (TSMS) feature fusion. Our proposed ConvBLS method is unprecedentedly efficient and effective.
arXiv Detail & Related papers (2023-04-01T04:16:12Z)
Aligning Bag of Regions for Open-Vocabulary Object Detection [74.89762864838042]
We propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. Our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks.
arXiv Detail & Related papers (2023-02-27T17:39:21Z)
CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation [67.12857074801731]
We introduce a novel method, CPPF++, designed for sim-to-real pose estimation. To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty. We incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a feature ensemble.
arXiv Detail & Related papers (2022-11-24T03:27:00Z)
Deep Active Ensemble Sampling For Image Classification [8.31483061185317]
Active learning frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points. Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches. We present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy. Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy.
arXiv Detail & Related papers (2022-10-11T20:20:20Z)
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization [73.14053674836838]
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. We present MEDet, a novel OVD framework with proposal mining and prediction equalization.
arXiv Detail & Related papers (2022-06-22T14:30:41Z)
Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps. We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.