Learning Customized Visual Models with Retrieval-Augmented Knowledge
- URL: http://arxiv.org/abs/2301.07094v1
- Date: Tue, 17 Jan 2023 18:59:06 GMT
- Title: Learning Customized Visual Models with Retrieval-Augmented Knowledge
- Authors: Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae
Lee, Chunyuan Li
- Abstract summary: We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
- Score: 104.05456849611895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image-text contrastive learning models such as CLIP have demonstrated strong
task transfer ability. The high generality and usability of these visual models
is achieved via a web-scale data collection process to ensure broad concept
coverage, followed by expensive pre-training to feed all the knowledge into
model weights. Alternatively, we propose REACT, REtrieval-Augmented
CusTomization, a framework to acquire the relevant web knowledge to build
customized visual models for target domains. We retrieve the most relevant
image-text pairs (~3% of CLIP pre-training data) from the web-scale database as
external knowledge, and propose to customize the model by only training new
modualized blocks while freezing all the original weights. The effectiveness of
REACT is demonstrated via extensive experiments on classification, retrieval,
detection and segmentation tasks, including zero, few, and full-shot settings.
Particularly, on the zero-shot classification task, compared with CLIP, it
achieves up to 5.4% improvement on ImageNet and 3.7% on the ELEVATER benchmark
(20 datasets).
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification [1.7265013728931]
This paper introduces a novel framework for zero-shot learning (ZSL) to recognize new categories that are unseen during training.
We propose three strategies to enhance the model's performance to handle ZSL.
arXiv Detail & Related papers (2024-05-03T15:02:41Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and
Inter-Image Attention Design [40.97593636235116]
We propose a new intra-client and inter-image attention (ICIIA) module into existing backbone recognition models.
In particular, given a target image from a certain client, ICIIA introduces multi-head self-attention to retrieve relevant images from the client's historical unlabeled images.
We evaluate ICIIA using 3 different recognition tasks with 9 backbone models over 5 representative datasets.
arXiv Detail & Related papers (2022-11-11T15:33:21Z) - Reducing Overlearning through Disentangled Representations by
Suppressing Unknown Tasks [8.517620051440005]
Existing deep learning approaches for learning visual features tend to overlearn and extract more information than what is required for the task at hand.
From a privacy preservation perspective, the input visual information is not protected from the model.
We propose a model-agnostic solution for reducing model overlearning by suppressing all the unknown tasks.
arXiv Detail & Related papers (2020-05-20T17:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.