OvarNet: Towards Open-vocabulary Object Attribute Recognition
- URL: http://arxiv.org/abs/2301.09506v1
- Date: Mon, 23 Jan 2023 15:59:29 GMT
- Title: OvarNet: Towards Open-vocabulary Object Attribute Recognition
- Authors: Keyan Chen, Xiaolong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen,
Weidi Xie
- Abstract summary: We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
- Score: 42.90477523238336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider the problem of simultaneously detecting objects
and inferring their visual attributes in an image, even for those with no
manual annotations provided at the training stage, resembling an
open-vocabulary scenario. To achieve this goal, we make the following
contributions: (i) we start with a naive two-stage approach for open-vocabulary
object detection and attribute classification, termed CLIP-Attr. The candidate
objects are first proposed with an offline RPN and later classified for
semantic category and attributes; (ii) we combine all available datasets and
train with a federated strategy to finetune the CLIP model, aligning the visual
representation with attributes, additionally, we investigate the efficacy of
leveraging freely available online image-caption pairs under weakly supervised
learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model
end-to-end with knowledge distillation, that performs class-agnostic object
proposals and classification on semantic categories and attributes with
classifiers generated from a text encoder; Finally, (iv) we conduct extensive
experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition
of semantic category and attributes is complementary for visual scene
understanding, i.e., jointly training object detection and attributes
prediction largely outperform existing approaches that treat the two tasks
independently, demonstrating strong generalization ability to novel attributes
and categories.
Related papers
- ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition.
For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence.
We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z) - Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics.
We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - Adaptive Prototypical Networks with Label Words and Joint Representation
Learning for Few-Shot Relation Classification [17.237331828747006]
This work focuses on few-shot relation classification (FSRC)
We propose an adaptive mixture mechanism to add label words to the representation of the class prototype.
Experiments have been conducted on FewRel under different few-shot (FS) settings.
arXiv Detail & Related papers (2021-01-10T11:25:42Z) - Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot
Recognition [27.0842107128122]
We devise an attributes-guided attention module (AGAM) to utilize human-annotated attributes and learn more discriminative features.
Our proposed module can significantly improve simple metric-based approaches to achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-09-10T08:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.