Related papers: Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models

Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models

URL: http://arxiv.org/abs/2304.04521v3
Date: Wed, 23 Aug 2023 13:11:20 GMT
Title: Zero-Shot In-Distribution Detection in Multi-Object Settings Using Vision-Language Foundation Models
Authors: Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
Abstract summary: In this paper, we propose a novel problem setting called zero-shot in-distribution (ID) detection. We identify images containing ID objects as ID images (even if they contain OOD objects) and images lacking ID objects as OOD images without any training. We present a simple and effective approach, Global-Local Concept Matching, based on both global and local visual-text alignments of CLIP features.
Score: 37.36999826208225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extracting in-distribution (ID) images from noisy images scraped from the Internet is an important preprocessing for constructing datasets, which has traditionally been done manually. Automating this preprocessing with deep learning techniques presents two key challenges. First, images should be collected using only the name of the ID class without training on the ID data. Second, as we can see why COCO was created, it is crucial to identify images containing not only ID objects but also both ID and out-of-distribution (OOD) objects as ID images to create robust recognizers. In this paper, we propose a novel problem setting called zero-shot in-distribution (ID) detection, where we identify images containing ID objects as ID images (even if they contain OOD objects), and images lacking ID objects as OOD images without any training. To solve this problem, we leverage the powerful zero-shot capability of CLIP and present a simple and effective approach, Global-Local Maximum Concept Matching (GL-MCM), based on both global and local visual-text alignments of CLIP features. Extensive experiments demonstrate that GL-MCM outperforms comparison methods on both multi-object datasets and single-object ImageNet benchmarks. The code will be available via https://github.com/AtsuMiyai/GL-MCM.

Related papers

Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution [9.666827340439669]
Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems. We propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR.
arXiv Detail & Related papers (2025-03-20T11:43:55Z)
MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing [12.491684385808902]
MMO-IG is designed to generate RS images with supervised object labels from global and local aspects simultaneously. Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph. Our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels.
arXiv Detail & Related papers (2024-12-18T10:19:12Z)
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM [38.8308841469793]
This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. We leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM) to exploit consistent visual elements within multiple images. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
arXiv Detail & Related papers (2024-12-12T18:59:48Z)
Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z)
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z)
A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression. We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z)
From Global to Local: Multi-scale Out-of-distribution Detection [129.37607313927458]
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection. We propose Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details.
arXiv Detail & Related papers (2023-08-20T11:56:25Z)
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval [11.696941841000985]
Two-stage methods following retrieve-and-rerank paradigm have achieved excellent performance, but their separate local and global modules are inefficient to real-world applications. We propose a mechanism which attentively selects prominent local descriptors and infuse fine-grained semantic relations into the global representation. Our method achieves state-of-the-art single-stage image retrieval performance on benchmarks such as Revisited Oxford and Revisited Paris.
arXiv Detail & Related papers (2023-08-08T03:06:10Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
Detector Guidance for Multi-Object Text-to-Image Generation [61.70018793720616]
Detector Guidance (DG) integrates a latent object detection model to separate different objects during the generation process. Human evaluations demonstrate that DG provides an 8-22% advantage in preventing the amalgamation of conflicting concepts.
arXiv Detail & Related papers (2023-06-04T02:33:12Z)
Adaptive Graph Convolution Module for Salient Object Detection [7.278033100480174]
We propose an adaptive graph convolution module (AGCM) to deal with complex scenes. Prototype features are extracted from the input image using a learnable region generation layer. The proposed AGCM dramatically improves the SOD performance both quantitatively and quantitatively.
arXiv Detail & Related papers (2023-03-17T07:07:17Z)
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z)
Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images [108.79667788962425]
salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. We propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features.
arXiv Detail & Related papers (2021-12-02T04:46:40Z)
Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS) Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z)
Tasks Integrated Networks: Joint Detection and Retrieval for Image Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated. We first introduce an end-to-end Integrated Net (I-Net), which has three merits. We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z)
A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation [1.8426817621478804]
We investigate a method based on U-Net to detect the document edges and text regions in ID images. We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited. Our results showed that the proposed models are efficient to document segmentation tasks and portable.
arXiv Detail & Related papers (2020-04-03T00:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.