Aligning Bag of Regions for Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2302.13996v1
- Date: Mon, 27 Feb 2023 17:39:21 GMT
- Title: Aligning Bag of Regions for Open-Vocabulary Object Detection
- Authors: Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, Chen Change Loy
- Abstract summary: We propose to align the embedding of bag of regions beyond individual regions.
The proposed method groups contextually interrelated regions as a bag.
Our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks.
- Score: 74.89762864838042
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually
contains a bag of semantic concepts. However, existing open-vocabulary object
detectors only align region embeddings individually with the corresponding
features extracted from the VLMs. Such a design leaves the compositional
structure of semantic concepts in a scene under-exploited, although the
structure may be implicitly learned by the VLMs. In this work, we propose to
align the embedding of bag of regions beyond individual regions. The proposed
method groups contextually interrelated regions as a bag. The embeddings of
regions in a bag are treated as embeddings of words in a sentence, and they are
sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which
is learned to be aligned to the corresponding features extracted by a frozen
VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the
previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of
open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are
available at https://github.com/wusize/ovdet.
Related papers
- LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - OV-VG: A Benchmark for Open-Vocabulary Visual Grounding [33.02137080950678]
This research endeavor introduces novel and challenging open-vocabulary visual tasks.
The overarching aim is to establish connections between language descriptions and the localization of novel objects.
We have curated a benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images.
arXiv Detail & Related papers (2023-10-22T17:54:53Z) - Locate Then Generate: Bridging Vision and Language with Bounding Box for
Scene-Text VQA [15.74007067413724]
We propose a novel framework for Scene Text Visual Question Answering (STVQA)
It requires models to read scene text in images for question answering.
arXiv Detail & Related papers (2023-04-04T07:46:40Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.