Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
Tasks
- URL: http://arxiv.org/abs/2307.06795v1
- Date: Thu, 13 Jul 2023 15:05:34 GMT
- Title: Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
Tasks
- Authors: Denis Coquenet and Cl\'ement Rambour and Emanuele Dalsasso and Nicolas
Thome
- Abstract summary: Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets.
We propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models.
- Score: 17.367599062853156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language foundation models such as CLIP have shown impressive
zero-shot performance on many tasks and datasets, especially thanks to their
free-text inputs. However, they struggle to handle some downstream tasks, such
as fine-grained attribute detection and localization. In this paper, we propose
a multitask fine-tuning strategy based on a positive/negative prompt
formulation to further leverage the capacities of the vision-language
foundation models. Using the CLIP architecture as baseline, we show strong
improvements on bird fine-grained attribute detection and localization tasks,
while also increasing the classification performance on the CUB200-2011
dataset. We provide source code for reproducibility purposes: it is available
at https://github.com/FactoDeepLearning/MultitaskVLFM.
Related papers
- CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection [30.46562066023117]
We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection.
Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes.
Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of our proposed method.
arXiv Detail & Related papers (2024-10-08T08:36:12Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions [24.596929878045568]
We develop methods to train vision-language models (VLMs) with "bag-level" image-text supervision.
We use descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets.
Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance.
arXiv Detail & Related papers (2024-01-04T08:39:13Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Structured Vision-Language Pretraining for Computational Cooking [54.0571416522547]
Vision-Language Pretraining and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks.
We propose to leverage these techniques for structured-text based computational cuisine tasks.
arXiv Detail & Related papers (2022-12-08T13:37:17Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic
Parsing [85.35582118010608]
Task-oriented semantic parsing is a critical component of virtual assistants.
Recent advances in deep learning have enabled several approaches to successfully parse more complex queries.
We propose a novel method that outperforms a supervised neural model at a 10-fold data reduction.
arXiv Detail & Related papers (2020-10-07T17:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.