Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision
- URL: http://arxiv.org/abs/2210.13591v2
- Date: Thu, 27 Oct 2022 09:12:13 GMT
- Title: Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision
- Authors: Tzu-Jui Julius Wang, Jorma Laaksonen, Tomas Langer, Heikki Arponen,
and Tom E. Bishop
- Abstract summary: Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data.
Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks.
We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH)
WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
- Score: 6.8582563015193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning
cross-modal alignment with little or no paired data, such as aligned images and
captions. Recent W-VLP methods, which pair visual features with object tags,
help achieve performances comparable with some VLP models trained with aligned
pairs in various V-L downstream tasks. This, however, is not the case in
cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model
is curbed and biased by the object tags of limited semantics.
We address the lack of paired V-L data for model supervision with a novel
Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak
supervision as a W-VLP model, not requiring images paired with captions. WFH
generates visual hallucinations from texts, which are then paired with the
originally unpaired texts, allowing more diverse interactions across
modalities.
Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT
(U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc.
Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on
image-to-text and text-to-image retrieval on two popular datasets Flickr30K and
MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization
tests on these XMR tasks. Moreover, in other V-L downstream tasks considered,
our WFH models are on par with models trained with paired V-L data, revealing
the utility of unpaired data. These results demonstrate greater generalization
of the proposed W-VLP model with WFH.
Related papers
- PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language
Pre-training via Prompting [8.784049710686629]
We propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images.
We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL.
arXiv Detail & Related papers (2023-07-14T13:43:04Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective [52.52870614418373]
Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
arXiv Detail & Related papers (2022-10-18T02:55:58Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Contrastive Visual-Linguistic Pretraining [48.88553854384866]
Contrastive Visual-Linguistic Pretraining constructs a visual self-supervised loss built upon contrastive learning.
We evaluate it on several down-stream tasks, including VQA, GQA and NLVR2.
arXiv Detail & Related papers (2020-07-26T14:26:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.