LOWA: Localize Objects in the Wild with Attributes
- URL: http://arxiv.org/abs/2305.20047v1
- Date: Wed, 31 May 2023 17:21:24 GMT
- Title: LOWA: Localize Objects in the Wild with Attributes
- Authors: Xiaoyuan Guo, Kezhen Chen, Jinmeng Rao, Yawen Zhang, Baochen Sun, Jie
Yang
- Abstract summary: We present LOWA, a novel method for localizing objects with attributes effectively in the wild.
It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names.
- Score: 8.922263691331912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present LOWA, a novel method for localizing objects with attributes
effectively in the wild. It aims to address the insufficiency of current
open-vocabulary object detectors, which are limited by the lack of
instance-level attribute classification and rare class names. To train LOWA, we
propose a hybrid vision-language training strategy to learn object detection
and recognition with class names as well as attribute information. With LOWA,
users can not only detect objects with class names, but also able to localize
objects by attributes. LOWA is built on top of a two-tower vision-language
architecture and consists of a standard vision transformer as the image encoder
and a similar transformer as the text encoder. To learn the alignment between
visual and text inputs at the instance level, we train LOWA with three training
steps: object-level training, attribute-aware learning, and free-text joint
training of objects and attributes. This hybrid training strategy first ensures
correct object detection, then incorporates instance-level attribute
information, and finally balances the object class and attribute sensitivity.
We evaluate our model performance of attribute classification and attribute
localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and
the Visual Attributes in the Wild (VAW) dataset, and experiments indicate
strong zero-shot performance. Ablation studies additionally demonstrate the
effectiveness of each training step of our approach.
Related papers
- Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition [11.316708754749103]
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior.
We propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information.
Our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition.
arXiv Detail & Related papers (2025-02-18T06:39:28Z) - Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations.
To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module.
To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module.
The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z) - Attribute Localization and Revision Network for Zero-Shot Learning [13.530912616208722]
Zero-shot learning enables the model to recognize unseen categories with the aid of auxiliary semantic information such as attributes.
In this paper, we find that the choice between local and global features is not a zero-sum game, global features can also contribute to the understanding of attributes.
arXiv Detail & Related papers (2023-10-11T14:50:52Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Investigating the Role of Attribute Context in Vision-Language Models
for Object Recognition and Detection [33.77415850289717]
Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context.
It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding.
Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective.
arXiv Detail & Related papers (2023-03-17T16:14:37Z) - OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr.
The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes.
We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z) - Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling.
Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model.
Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - Attribute Prototype Network for Zero-Shot Learning [113.50220968583353]
We propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features.
Our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation.
arXiv Detail & Related papers (2020-08-19T06:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.