Towards Open-Ended Visual Recognition with Large Language Model
- URL: http://arxiv.org/abs/2311.08400v1
- Date: Tue, 14 Nov 2023 18:59:01 GMT
- Title: Towards Open-Ended Visual Recognition with Large Language Model
- Authors: Qihang Yu, Xiaohui Shen, Liang-Chieh Chen
- Abstract summary: We introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier.
OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing.
It also enables cross-dataset training without any human interference.
- Score: 27.56182473356992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing and recognizing objects in the open-ended physical world poses a
long-standing challenge within the domain of machine perception. Recent methods
have endeavored to address the issue by employing a class-agnostic mask (or
box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP)
using pre-extracted text embeddings. However, it is worth noting that these
open-vocabulary recognition models still exhibit limitations in practical
applications. On one hand, they rely on the provision of class names during
testing, where the recognition performance heavily depends on this predefined
set of semantic classes by users. On the other hand, when training with
multiple datasets, human intervention is required to alleviate the label
definition conflict between them. In this paper, we introduce the OmniScient
Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a
straightforward and effective solution to the aforementioned challenges.
Specifically, OSM predicts class labels in a generative manner, thus removing
the supply of class names during both training and testing. It also enables
cross-dataset training without any human interference, exhibiting robust
generalization capabilities due to the world knowledge acquired from the LLM.
By combining OSM with an off-the-shelf mask proposal model, we present
promising results on various benchmarks, and demonstrate its effectiveness in
handling novel concepts. Code/model are available at
https://github.com/bytedance/OmniScient-Model.
Related papers
- Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Learning to recognize occluded and small objects with partial inputs [8.460351690226817]
Masked Supervised Learning is a single-stage, model-agnostic learning paradigm for multi-label image recognition.
We show that MSL is robust to random masking and demonstrate its effectiveness in recognizing non-masked objects.
arXiv Detail & Related papers (2023-10-27T22:29:27Z) - Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual
Mask Annotations [86.47908754383198]
Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories.
Our method generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs.
Our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset.
arXiv Detail & Related papers (2023-03-29T17:58:39Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.