Zero-Shot In-Distribution Detection in Multi-Object Settings Using
Vision-Language Foundation Models
- URL: http://arxiv.org/abs/2304.04521v3
- Date: Wed, 23 Aug 2023 13:11:20 GMT
- Title: Zero-Shot In-Distribution Detection in Multi-Object Settings Using
Vision-Language Foundation Models
- Authors: Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
- Abstract summary: In this paper, we propose a novel problem setting called zero-shot in-distribution (ID) detection.
We identify images containing ID objects as ID images (even if they contain OOD objects) and images lacking ID objects as OOD images without any training.
We present a simple and effective approach, Global-Local Concept Matching, based on both global and local visual-text alignments of CLIP features.
- Score: 37.36999826208225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting in-distribution (ID) images from noisy images scraped from the
Internet is an important preprocessing for constructing datasets, which has
traditionally been done manually. Automating this preprocessing with deep
learning techniques presents two key challenges. First, images should be
collected using only the name of the ID class without training on the ID data.
Second, as we can see why COCO was created, it is crucial to identify images
containing not only ID objects but also both ID and out-of-distribution (OOD)
objects as ID images to create robust recognizers. In this paper, we propose a
novel problem setting called zero-shot in-distribution (ID) detection, where we
identify images containing ID objects as ID images (even if they contain OOD
objects), and images lacking ID objects as OOD images without any training. To
solve this problem, we leverage the powerful zero-shot capability of CLIP and
present a simple and effective approach, Global-Local Maximum Concept Matching
(GL-MCM), based on both global and local visual-text alignments of CLIP
features. Extensive experiments demonstrate that GL-MCM outperforms comparison
methods on both multi-object datasets and single-object ImageNet benchmarks.
The code will be available via https://github.com/AtsuMiyai/GL-MCM.
Related papers
- A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Detector Guidance for Multi-Object Text-to-Image Generation [61.70018793720616]
Detector Guidance (DG) integrates a latent object detection model to separate different objects during the generation process.
Human evaluations demonstrate that DG provides an 8-22% advantage in preventing the amalgamation of conflicting concepts.
arXiv Detail & Related papers (2023-06-04T02:33:12Z) - CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks.
The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID.
The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z) - A Fast Fully Octave Convolutional Neural Network for Document Image
Segmentation [1.8426817621478804]
We investigate a method based on U-Net to detect the document edges and text regions in ID images.
We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited.
Our results showed that the proposed models are efficient to document segmentation tasks and portable.
arXiv Detail & Related papers (2020-04-03T00:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.