Delving into Multimodal Prompting for Fine-grained Visual Classification
- URL: http://arxiv.org/abs/2309.08912v2
- Date: Wed, 13 Dec 2023 05:24:46 GMT
- Title: Delving into Multimodal Prompting for Fine-grained Visual Classification
- Authors: Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li
- Abstract summary: Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category.
Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks.
We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
- Score: 57.12570556836394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained visual classification (FGVC) involves categorizing fine
subdivisions within a broader category, which poses challenges due to subtle
inter-class discrepancies and large intra-class variations. However, prevailing
approaches primarily focus on uni-modal visual concepts. Recent advancements in
pre-trained vision-language models have demonstrated remarkable performance in
various high-level vision tasks, yet the applicability of such models to FGVC
tasks remains uncertain. In this paper, we aim to fully exploit the
capabilities of cross-modal description to tackle FGVC tasks and propose a
novel multimodal prompting solution, denoted as MP-FGVC, based on the
contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a
multimodal prompts scheme and a multimodal adaptation scheme. The former
includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text
Prompt (DaTP), which explicitly highlights the subcategory-specific
discrepancies from the perspectives of both vision and language. The latter
aligns the vision and text prompting elements in a common semantic space,
facilitating cross-modal collaborative reasoning through a Vision-Language
Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a
two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained
CLIP model and expedite efficient adaptation for FGVC. Extensive experiments
conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
Related papers
- Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks [0.0]
Vision-Driven Prompt Optimization (VDPO) generates textual prompts from visual inputs, guiding high-fidelity image synthesis.
VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores.
Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
arXiv Detail & Related papers (2025-01-05T13:01:47Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Context-Semantic Quality Awareness Network for Fine-Grained Visual Categorization [30.92656780805478]
We propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for fine-grained visual categorization (FGVC)
To model the spatial contextual relationship between rich part descriptors and global semantics, we develop a novel multi-part and multi-scale cross-attention (MPMSCA) module.
We also propose a generic multi-level semantic quality evaluation module (MLSQE) to progressively supervise and enhance hierarchical semantics from different levels of the backbone network.
arXiv Detail & Related papers (2024-03-15T13:40:44Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.