Learning Concise and Descriptive Attributes for Visual Recognition
- URL: http://arxiv.org/abs/2308.03685v1
- Date: Mon, 7 Aug 2023 16:00:22 GMT
- Title: Learning Concise and Descriptive Attributes for Visual Recognition
- Authors: An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William
Wang, Jingbo Shang, Julian McAuley
- Abstract summary: We show that querying thousands of attributes can achieve performance competitive with image features.
We propose a novel learning-to-search method to discover those concise sets of attributes.
- Score: 25.142065847381758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in foundation models present new opportunities for
interpretable visual recognition -- one can first query Large Language Models
(LLMs) to obtain a set of attributes that describe each class, then apply
vision-language models to classify images via these attributes. Pioneering work
shows that querying thousands of attributes can achieve performance competitive
with image features. However, our further investigation on 8 datasets reveals
that LLM-generated attributes in a large quantity perform almost the same as
random words. This surprising finding suggests that significant noise may be
present in these attributes. We hypothesize that there exist subsets of
attributes that can maintain the classification performance with much smaller
sizes, and propose a novel learning-to-search method to discover those concise
sets of attributes. As a result, on the CUB dataset, our method achieves
performance close to that of massive LLM-generated attributes (e.g., 10k
attributes for CUB), yet using only 32 attributes in total to distinguish 200
bird species. Furthermore, our new paradigm demonstrates several additional
benefits: higher interpretability and interactivity for humans, and the ability
to summarize knowledge for a recognition task.
Related papers
- Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition [1.2499537119440243]
We tackle zero shot "real" classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) to classify objects based solely on descriptive attributes, excluding object class names.
We release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning.
We introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes.
arXiv Detail & Related papers (2024-12-18T15:28:08Z) - Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations.
To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module.
To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module.
The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z) - Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition.
Our method captures inter-class differences and intra-class commonalities in the form of natural language.
VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Boosting Generative Zero-Shot Learning by Synthesizing Diverse Features
with Attribute Augmentation [21.72622601533585]
We propose a novel framework to boost Zero-Shot Learning (ZSL) by synthesizing diverse features.
This method uses augmented semantic attributes to train the generative model, so as to simulate the real distribution of visual features.
We evaluate the proposed model on four benchmark datasets, observing significant performance improvement against the state-of-the-art.
arXiv Detail & Related papers (2021-12-23T14:32:51Z) - Shaping Visual Representations with Attributes for Few-Shot Learning [5.861206243996454]
Few-shot recognition aims to recognize novel categories under low-data regimes.
Recent metric-learning based few-shot learning methods have achieved promising performances.
We propose attribute-shaped learning (ASL), which can normalize visual representations to predict attributes for query images.
arXiv Detail & Related papers (2021-12-13T03:16:19Z) - FashionSearchNet-v2: Learning Attribute Representations with
Localization for Image Retrieval with Attribute Manipulation [22.691709684780292]
The proposed FashionSearchNet-v2 architecture is able to learn attribute specific representations by leveraging on its weakly-supervised localization module.
The network is jointly trained with the combination of attribute classification and triplet ranking loss to estimate local representations.
Experiments performed on several datasets that are rich in terms of the number of attributes show that FashionSearchNet-v2 outperforms the other state-of-the-art attribute manipulation techniques.
arXiv Detail & Related papers (2021-11-28T13:50:20Z) - Learning Compositional Representation for Few-shot Visual Question
Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.