Learning Concise and Descriptive Attributes for Visual Recognition
- URL: http://arxiv.org/abs/2308.03685v1
- Date: Mon, 7 Aug 2023 16:00:22 GMT
- Title: Learning Concise and Descriptive Attributes for Visual Recognition
- Authors: An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William
Wang, Jingbo Shang, Julian McAuley
- Abstract summary: We show that querying thousands of attributes can achieve performance competitive with image features.
We propose a novel learning-to-search method to discover those concise sets of attributes.
- Score: 25.142065847381758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in foundation models present new opportunities for
interpretable visual recognition -- one can first query Large Language Models
(LLMs) to obtain a set of attributes that describe each class, then apply
vision-language models to classify images via these attributes. Pioneering work
shows that querying thousands of attributes can achieve performance competitive
with image features. However, our further investigation on 8 datasets reveals
that LLM-generated attributes in a large quantity perform almost the same as
random words. This surprising finding suggests that significant noise may be
present in these attributes. We hypothesize that there exist subsets of
attributes that can maintain the classification performance with much smaller
sizes, and propose a novel learning-to-search method to discover those concise
sets of attributes. As a result, on the CUB dataset, our method achieves
performance close to that of massive LLM-generated attributes (e.g., 10k
attributes for CUB), yet using only 32 attributes in total to distinguish 200
bird species. Furthermore, our new paradigm demonstrates several additional
benefits: higher interpretability and interactivity for humans, and the ability
to summarize knowledge for a recognition task.
Related papers
- Evolving Interpretable Visual Classifiers with Large Language Models [34.4903887876357]
Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance.
vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down.
We present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition.
arXiv Detail & Related papers (2024-04-15T17:09:53Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Boosting Generative Zero-Shot Learning by Synthesizing Diverse Features
with Attribute Augmentation [21.72622601533585]
We propose a novel framework to boost Zero-Shot Learning (ZSL) by synthesizing diverse features.
This method uses augmented semantic attributes to train the generative model, so as to simulate the real distribution of visual features.
We evaluate the proposed model on four benchmark datasets, observing significant performance improvement against the state-of-the-art.
arXiv Detail & Related papers (2021-12-23T14:32:51Z) - Shaping Visual Representations with Attributes for Few-Shot Learning [5.861206243996454]
Few-shot recognition aims to recognize novel categories under low-data regimes.
Recent metric-learning based few-shot learning methods have achieved promising performances.
We propose attribute-shaped learning (ASL), which can normalize visual representations to predict attributes for query images.
arXiv Detail & Related papers (2021-12-13T03:16:19Z) - Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel
Attribute Synthesis [65.74825840440504]
We propose Zero Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge.
Our proposed method is able to synthesize the detectors of novel attributes in a zero-shot learning manner.
With using only 32 seen attributes on the Caltech-UCSD Birds-200-2011 dataset, our proposed method is able to synthesize other 207 novel attributes.
arXiv Detail & Related papers (2021-11-28T15:45:54Z) - FashionSearchNet-v2: Learning Attribute Representations with
Localization for Image Retrieval with Attribute Manipulation [22.691709684780292]
The proposed FashionSearchNet-v2 architecture is able to learn attribute specific representations by leveraging on its weakly-supervised localization module.
The network is jointly trained with the combination of attribute classification and triplet ranking loss to estimate local representations.
Experiments performed on several datasets that are rich in terms of the number of attributes show that FashionSearchNet-v2 outperforms the other state-of-the-art attribute manipulation techniques.
arXiv Detail & Related papers (2021-11-28T13:50:20Z) - Learning Compositional Representation for Few-shot Visual Question
Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples.
We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones.
Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z) - Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot
Recognition [27.0842107128122]
We devise an attributes-guided attention module (AGAM) to utilize human-annotated attributes and learn more discriminative features.
Our proposed module can significantly improve simple metric-based approaches to achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-09-10T08:38:32Z) - Selecting Relevant Features from a Multi-domain Representation for
Few-shot Classification [91.67977602992657]
We propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches.
We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training.
arXiv Detail & Related papers (2020-03-20T15:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.