Delving into Multimodal Prompting for Fine-grained Visual Classification
- URL: http://arxiv.org/abs/2309.08912v2
- Date: Wed, 13 Dec 2023 05:24:46 GMT
- Title: Delving into Multimodal Prompting for Fine-grained Visual Classification
- Authors: Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
- Abstract summary: Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category.
Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks.
We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
- Score: 57.12570556836394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained visual classification (FGVC) involves categorizing fine
subdivisions within a broader category, which poses challenges due to subtle
inter-class discrepancies and large intra-class variations. However, prevailing
approaches primarily focus on uni-modal visual concepts. Recent advancements in
pre-trained vision-language models have demonstrated remarkable performance in
various high-level vision tasks, yet the applicability of such models to FGVC
tasks remains uncertain. In this paper, we aim to fully exploit the
capabilities of cross-modal description to tackle FGVC tasks and propose a
novel multimodal prompting solution, denoted as MP-FGVC, based on the
contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a
multimodal prompts scheme and a multimodal adaptation scheme. The former
includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text
Prompt (DaTP), which explicitly highlights the subcategory-specific
discrepancies from the perspectives of both vision and language. The latter
aligns the vision and text prompting elements in a common semantic space,
facilitating cross-modal collaborative reasoning through a Vision-Language
Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a
two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained
CLIP model and expedite efficient adaptation for FGVC. Extensive experiments
conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models [7.350203999073509]
Feature Guidance Attack (FGA) is a novel method that uses text representations to direct the perturbation of clean images.
Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings.
arXiv Detail & Related papers (2024-07-25T06:10:33Z) - Context-Semantic Quality Awareness Network for Fine-Grained Visual Categorization [30.92656780805478]
We propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for fine-grained visual categorization (FGVC)
To model the spatial contextual relationship between rich part descriptors and global semantics, we develop a novel multi-part and multi-scale cross-attention (MPMSCA) module.
We also propose a generic multi-level semantic quality evaluation module (MLSQE) to progressively supervise and enhance hierarchical semantics from different levels of the backbone network.
arXiv Detail & Related papers (2024-03-15T13:40:44Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Interpretable Attention Guided Network for Fine-grained Visual
Classification [36.657203916383594]
Fine-grained visual classification (FGVC) is challenging but more critical than traditional classification tasks.
We propose an Interpretable Attention Guided Network (IAGN) for fine-grained visual classification.
arXiv Detail & Related papers (2021-03-08T12:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.