Related papers: LOWA: Localize Objects in the Wild with Attributes

LOWA: Localize Objects in the Wild with Attributes

URL: http://arxiv.org/abs/2305.20047v1
Date: Wed, 31 May 2023 17:21:24 GMT
Title: LOWA: Localize Objects in the Wild with Attributes
Authors: Xiaoyuan Guo, Kezhen Chen, Jinmeng Rao, Yawen Zhang, Baochen Sun, Jie Yang
Abstract summary: We present LOWA, a novel method for localizing objects with attributes effectively in the wild. It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names.
Score: 8.922263691331912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present LOWA, a novel method for localizing objects with attributes effectively in the wild. It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names. To train LOWA, we propose a hybrid vision-language training strategy to learn object detection and recognition with class names as well as attribute information. With LOWA, users can not only detect objects with class names, but also able to localize objects by attributes. LOWA is built on top of a two-tower vision-language architecture and consists of a standard vision transformer as the image encoder and a similar transformer as the text encoder. To learn the alignment between visual and text inputs at the instance level, we train LOWA with three training steps: object-level training, attribute-aware learning, and free-text joint training of objects and attributes. This hybrid training strategy first ensures correct object detection, then incorporates instance-level attribute information, and finally balances the object class and attribute sensitivity. We evaluate our model performance of attribute classification and attribute localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and the Visual Attributes in the Wild (VAW) dataset, and experiments indicate strong zero-shot performance. Ablation studies additionally demonstrate the effectiveness of each training step of our approach.

Related papers

Compositional Caching for Training-free Open-vocabulary Attribute Detection [65.46250297408974]
We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines.
arXiv Detail & Related papers (2025-03-24T21:00:37Z)
Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition [11.316708754749103]
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. We propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information. Our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition.
arXiv Detail & Related papers (2025-02-18T06:39:28Z)
Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z)
Tree of Attributes Prompt Learning for Vision-Language Models [27.64685205305313]
We propose the Tree of Attributes Prompt learning (TAP), which generates a tree of attributes with a "concept - attribute - description" structure for each category. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs. Our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts.
arXiv Detail & Related papers (2024-10-15T02:37:39Z)
Attribute Localization and Revision Network for Zero-Shot Learning [13.530912616208722]
Zero-shot learning enables the model to recognize unseen categories with the aid of auxiliary semantic information such as attributes. In this paper, we find that the choice between local and global features is not a zero-sum game, global features can also contribute to the understanding of attributes.
arXiv Detail & Related papers (2023-10-11T14:50:52Z)
Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts. One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different. We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z)
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection [33.77415850289717]
Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective.
arXiv Detail & Related papers (2023-03-17T16:14:37Z)
OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes. We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z)
Label2Label: A Language Modeling Framework for Multi-Attribute Learning [93.68058298766739]
Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling. Inspired by the success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model. Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints.
arXiv Detail & Related papers (2022-07-18T15:12:33Z)
Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks. We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z)
Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT) We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature. While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z)
Attribute Prototype Network for Zero-Shot Learning [113.50220968583353]
We propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features. Our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation.
arXiv Detail & Related papers (2020-08-19T06:46:35Z)
CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning [78.3857991931479]
We present GROLLA, an evaluation framework for Grounded Language Learning with Attributes. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations.
arXiv Detail & Related papers (2020-06-03T11:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.