Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
- URL: http://arxiv.org/abs/2305.14428v3
- Date: Wed, 10 Jul 2024 15:54:11 GMT
- Title: Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
- Authors: Wentao Bao, Lichang Chen, Heng Huang, Yu Kong,
- Abstract summary: Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
- Score: 73.49852821602057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts, e.g., sliced tomatoes, where the model is learned only from the seen compositions, e.g., sliced potatoes and red tomatoes. Thanks to the prompt tuning on large pre-trained visual language models such as CLIP, recent literature shows impressively better CZSL performance than traditional vision-based methods. However, the key aspects that impact the generalization to unseen compositions, including the diversity and informativeness of class context, and the entanglement between visual primitives, i.e., state and object, are not properly addressed in existing CLIP-based CZSL literature. In this paper, we propose a model by prompting the language-informed distribution, aka., PLID, for the CZSL task. Specifically, the PLID leverages pre-trained large language models (LLM) to (i) formulate the language-informed class distributions which are diverse and informative, and (ii) enhance the compositionality of the class embedding. Moreover, a visual-language primitive decomposition (VLPD) module is proposed to dynamically fuse the classification decisions from the compositional and the primitive space. Orthogonal to the existing literature of soft, hard, or distributional prompts, our method advocates prompting the LLM-supported class distributions, leading to a better zero-shot generalization. Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts. Our code and models are released: https://github.com/Cogito2012/PLID.
Related papers
- Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability.
The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts.
As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z) - Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - CAILA: Concept-Aware Intra-Layer Adapters for Compositional Zero-Shot
Learning [14.496173899477283]
We study the problem of Compositional Zero-Shot Learning (CZSL), which is to recognize novel attribute-object combinations with pre-existing concepts.
We propose to insert adapters, a parameter-efficient technique proven to be effective among large language models, into each CLIP encoder layer.
We further equip adapters with concept awareness so that concept-specific features of "object", "attribute", and "composition" can be extracted.
arXiv Detail & Related papers (2023-05-26T07:02:57Z) - Mutual Balancing in State-Object Components for Compositional Zero-Shot
Learning [0.0]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions from seen states and objects.
We propose a novel method called MUtual balancing in STate-object components (MUST) for CZSL, which provides a balancing inductive bias for the model.
Our approach significantly outperforms the state-of-the-art on MIT-States, UT-Zappos, and C-GQA when combined with the basic CZSL frameworks.
arXiv Detail & Related papers (2022-11-19T10:21:22Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Leveraging Seen and Unseen Semantic Relationships for Generative
Zero-Shot Learning [14.277015352910674]
We propose a generative model that explicitly performs knowledge transfer by incorporating a novel Semantic Regularized Loss (SR-Loss)
Experiments on seven benchmark datasets demonstrate the superiority of the LsrGAN compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-19T01:25:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.