PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language
Pre-training via Prompting
- URL: http://arxiv.org/abs/2307.07341v1
- Date: Fri, 14 Jul 2023 13:43:04 GMT
- Title: PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language
Pre-training via Prompting
- Authors: Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, Abduljalil Radman,
Jorma Laaksonen
- Abstract summary: We propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images.
We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL.
- Score: 8.784049710686629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language (VL) Pre-training (VLP) has shown to well generalize VL
models over a wide range of VL downstream tasks, especially for cross-modal
retrieval. However, it hinges on a huge amount of image-text pairs, which
requires tedious and costly curation. On the contrary, weakly-supervised VLP
(W-VLP) explores means with object tags generated by a pre-trained object
detector (OD) from images. Yet, they still require paired information, i.e.
images and object-level annotations, as supervision to train an OD.
To further reduce the amount of supervision, we propose Prompts-in-The-Loop
(PiTL) that prompts knowledge from large language models (LLMs) to describe
images. Concretely, given a category label of an image, e.g. refinery, the
knowledge, e.g. a refinery could be seen with large storage tanks, pipework,
and ..., extracted by LLMs is used as the language counterpart. The knowledge
supplements, e.g. the common relations among entities most likely appearing in
a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of
14K categories from ImageNet21K with PiTL. Empirically, the VL models
pre-trained with PiTL-generated pairs are strongly favored over other W-VLP
works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less
supervision. The results reveal the effectiveness of PiTL-generated pairs for
VLP.
Related papers
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Weakly Supervised Vision-and-Language Pre-training with Relative
Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training.
Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision [6.8582563015193]
Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data.
Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks.
We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH)
WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
arXiv Detail & Related papers (2022-10-24T20:30:55Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.