Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment
- URL: http://arxiv.org/abs/2308.12960v3
- Date: Sun, 24 Dec 2023 16:43:52 GMT
- Title: Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment
- Authors: Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman
Khan, Kun Zhang, Fahad Khan
- Abstract summary: Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
- Score: 53.2701026843921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-trained Vision Language Models (VLMs) have proven effective
for zero-shot classification. Despite the success, most traditional VLMs-based
methods are restricted by the assumption of partial source supervision or ideal
vocabularies, which rarely satisfy the open-world scenario. In this paper, we
aim at a more challenging setting, Realistic Zero-Shot Classification, which
assumes no annotation but instead a broad vocabulary. To address this
challenge, we propose the Self Structural Semantic Alignment (S^3A) framework,
which extracts the structural semantic information from unlabeled data while
simultaneously self-learning. Our S^3A framework adopts a unique
Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups
unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR
process includes iterative clustering on images, voting within each cluster to
identify initial class candidates from the vocabulary, generating
discriminative prompts with large language models to discern confusing
candidates, and realigning images and the vocabulary as structural semantic
alignment. Finally, we propose to self-learn the CLIP image encoder with both
individual and structural semantic alignment through a teacher-student learning
strategy. Our comprehensive experiments across various generic and fine-grained
benchmarks demonstrate that the S^3A method offers substantial improvements
over existing VLMs-based approaches, achieving a more than 15% accuracy
improvement over CLIP on average. Our codes, models, and prompts are publicly
released at https://github.com/sheng-eatamath/S3A.
Related papers
- Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.
We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.
Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z) - $S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models [41.244610382963764]
This paper proposes a textbfSynonymous textbfSemantic textbfSpace ($S3$) for each image class, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP.
Experiments are conducted across 17 benchmarks, including fine-grained zero-shot classification, natural distribution zero-shot classification, and open-vocabulary segmentation.
arXiv Detail & Related papers (2024-12-06T10:26:51Z) - Training-Free Semantic Segmentation via LLM-Supervision [37.9007813884699]
This paper introduces a new approach to text-supervised semantic segmentation using supervision by a large language model (LLM)
Our method starts from an LLM to generate a detailed set of subclasses for more accurate class representation.
We then employ an advanced text-supervised semantic segmentation model to apply the generated subclasses as target labels.
arXiv Detail & Related papers (2024-03-31T14:37:25Z) - Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [58.617025733655005]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z) - CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action
Recognition [52.66360172784038]
We propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually.
We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets.
arXiv Detail & Related papers (2021-01-18T12:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.