Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation
- URL: http://arxiv.org/abs/2510.09867v1
- Date: Fri, 10 Oct 2025 20:58:43 GMT
- Title: Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation
- Authors: Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang,
- Abstract summary: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs.<n>Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results.<n>We propose the Cluster-Aware Prompt Ensemble Learning framework, which preserves the cluster nature of context prompts.
- Score: 40.60703048681749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.
Related papers
- ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z) - In-Context Clustering with Large Language Models [50.25868718329313]
ICC captures complex relationships among inputs through an attention mechanism.<n>We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data.<n>Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering.
arXiv Detail & Related papers (2025-10-09T17:07:55Z) - Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift [14.85375816073596]
We introduce multiple learnable prompts per class to capture diverse modes in visual representations arising from distribution shifts.<n>These prompts are modeled as learnable Gaussian distributions, enabling efficient exploration of the prompt parameter space.<n>Experiments and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed framework.
arXiv Detail & Related papers (2025-06-04T13:18:04Z) - Cluster Specific Representation Learning [1.6727186769396276]
Despite its widespread application, there is no established definition of a good'' representation.<n>We propose a downstream-agnostic formulation: when inherent clusters exist in the data, the representations should be specific to each cluster.<n>Under this idea, we develop a meta-algorithm that jointly learns cluster-specific representations and cluster assignments.
arXiv Detail & Related papers (2024-12-04T16:59:37Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Unsupervised Visual Representation Learning by Online Constrained
K-Means [44.38989920488318]
Cluster discrimination is an effective pretext task for unsupervised representation learning.
We propose a novel clustering-based pretext task with online textbfConstrained textbfK-mtextbfeans (textbfCoKe)
Our online assignment method has a theoretical guarantee to approach the global optimum.
arXiv Detail & Related papers (2021-05-24T20:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.