CLIM: Contrastive Language-Image Mosaic for Region Representation
- URL: http://arxiv.org/abs/2312.11376v2
- Date: Tue, 19 Dec 2023 05:08:45 GMT
- Title: CLIM: Contrastive Language-Image Mosaic for Region Representation
- Authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Wentao Liu, Chen Change
Loy
- Abstract summary: Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
- Score: 58.05870131126816
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Detecting objects accurately from a large or open vocabulary necessitates the
vision-language alignment on region representations. However, learning such a
region-text alignment by obtaining high-quality box annotations with text
labels or descriptions is expensive and infeasible. In contrast, collecting
image-text pairs is simpler but lacks precise object location information to
associate regions with texts. In this paper, we propose a novel approach called
Contrastive Language-Image Mosaic (CLIM), which leverages large-scale
image-text pairs effectively for aligning region and text representations. CLIM
combines multiple images into a mosaicked image and treats each image as a
`pseudo region'. The feature of each pseudo region is extracted and trained to
be similar to the corresponding text embedding while dissimilar from others by
a contrastive loss, enabling the model to learn the region-text alignment
without costly box annotations. As a generally applicable approach, CLIM
consistently improves different open-vocabulary object detection methods that
use caption supervision. Furthermore, CLIM can effectively enhance the region
representation of vision-language models, thus providing stronger backbones for
open-vocabulary object detectors. Our experimental results demonstrate that
CLIM improves different baseline open-vocabulary object detectors by a large
margin on both OV-COCO and OV-LVIS benchmarks. The code is available at
https://github.com/wusize/CLIM.
Related papers
- RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship.
We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary
Object Detection [78.0010542552784]
CoDet is a novel approach to learn object-level vision-language representations for open-vocabulary object detection.
By grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence.
CoDet has superior performances and compelling scalability in open-vocabulary detection.
arXiv Detail & Related papers (2023-10-25T14:31:02Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.