LOWA: Localize Objects in the Wild with Attributes
- URL: http://arxiv.org/abs/2305.20047v1
- Date: Wed, 31 May 2023 17:21:24 GMT
- Title: LOWA: Localize Objects in the Wild with Attributes
- Authors: Xiaoyuan Guo, Kezhen Chen, Jinmeng Rao, Yawen Zhang, Baochen Sun, Jie
Yang
- Abstract summary: We present LOWA, a novel method for localizing objects with attributes effectively in the wild.
It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names.
- Score: 8.922263691331912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present LOWA, a novel method for localizing objects with attributes
effectively in the wild. It aims to address the insufficiency of current
open-vocabulary object detectors, which are limited by the lack of
instance-level attribute classification and rare class names. To train LOWA, we
propose a hybrid vision-language training strategy to learn object detection
and recognition with class names as well as attribute information. With LOWA,
users can not only detect objects with class names, but also able to localize
objects by attributes. LOWA is built on top of a two-tower vision-language
architecture and consists of a standard vision transformer as the image encoder
and a similar transformer as the text encoder. To learn the alignment between
visual and text inputs at the instance level, we train LOWA with three training
steps: object-level training, attribute-aware learning, and free-text joint
training of objects and attributes. This hybrid training strategy first ensures
correct object detection, then incorporates instance-level attribute
information, and finally balances the object class and attribute sensitivity.
We evaluate our model performance of attribute classification and attribute
localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and
the Visual Attributes in the Wild (VAW) dataset, and experiments indicate
strong zero-shot performance. Ablation studies additionally demonstrate the
effectiveness of each training step of our approach.
Related papers
- Tree of Attributes Prompt Learning for Vision-Language Models [27.64685205305313]
We propose the Tree of Attributes Prompt learning (TAP), which generates a tree of attributes with a "concept - attribute - description" structure for each category.
Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs.
Our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts.
arXiv Detail & Related papers (2024-10-15T02:37:39Z) - Attribute Localization and Revision Network for Zero-Shot Learning [13.530912616208722]
Zero-shot learning enables the model to recognize unseen categories with the aid of auxiliary semantic information such as attributes.
In this paper, we find that the choice between local and global features is not a zero-sum game, global features can also contribute to the understanding of attributes.
arXiv Detail & Related papers (2023-10-11T14:50:52Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Investigating the Role of Attribute Context in Vision-Language Models
for Object Recognition and Detection [33.77415850289717]
Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context.
It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding.
Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective.
arXiv Detail & Related papers (2023-03-17T16:14:37Z) - OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z) - Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling.
Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model.
Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - Attribute Prototype Network for Zero-Shot Learning [113.50220968583353]
We propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features.
Our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation.
arXiv Detail & Related papers (2020-08-19T06:46:35Z) - CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language
Learning [78.3857991931479]
We present GROLLA, an evaluation framework for Grounded Language Learning with Attributes.
We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations.
arXiv Detail & Related papers (2020-06-03T11:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.