Related papers: Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

URL: http://arxiv.org/abs/2407.04489v1
Date: Fri, 5 Jul 2024 13:15:29 GMT
Title: Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
Authors: Duy M. H. Nguyen, An T. Le, Trung Q. Nguyen, Nghiem T. Diep, Tai Nguyen, Duy Duong-Tran, Jan Peters, Li Shen, Mathias Niepert, Daniel Sonntag,
Abstract summary: We introduce a new framework based on a dual context of both domain-shared and class-specific contexts. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in Large Language Models. We also formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens.
Score: 27.56988000960972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT's characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines.

Related papers

MultiModal Fine-tuning with Synthetic Captions [9.572235167281686]
We propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs)<n>Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions for classification tasks.<n>Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning.
arXiv Detail & Related papers (2026-01-29T09:03:45Z)
Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model [45.52442142624057]
Few-Shot Adaptation (FSA-VLM) has emerged as a key solution to adapt models with minimal data.<n>Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using.<n>Efficient Fine-Tuning (PEFT) to adapt models with minimal data.<n>ADK is a novel framework that efficiently enriches text representations without compromising efficiency.
arXiv Detail & Related papers (2025-12-19T07:52:25Z)
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z)
Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models [21.20658517302458]
MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
arXiv Detail & Related papers (2025-07-11T08:45:27Z)
Discriminative Fine-tuning of LVLMs [67.14293827774827]
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. We propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs.
arXiv Detail & Related papers (2024-12-05T17:54:27Z)
Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z)
Self-regulating Prompts: Foundational Model Adaptation without Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC. PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET. We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images. We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.