Enhancing Vision-Language Pre-training with Rich Supervisions
- URL: http://arxiv.org/abs/2403.03346v1
- Date: Tue, 5 Mar 2024 22:14:58 GMT
- Title: Enhancing Vision-Language Pre-training with Rich Supervisions
- Authors: Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar
Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
- Abstract summary: We propose Strongly Supervised pre-training with ScreenShots (S4)
S4 is a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering.
We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks.
- Score: 60.269564094889446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel
pre-training paradigm for Vision-Language Models using data from large-scale
web screenshot rendering. Using web screenshots unlocks a treasure trove of
visual and textual cues that are not present in using image-text pairs. In S4,
we leverage the inherent tree-structured hierarchy of HTML elements and the
spatial localization to carefully design 10 pre-training tasks with large scale
annotated data. These tasks resemble downstream tasks across different domains
and the annotations are cheap to obtain. We demonstrate that, compared to
current screenshot pre-training objectives, our innovative pre-training method
significantly enhances performance of image-to-text model in nine varied and
popular downstream tasks - up to 76.1% improvements on Table Detection, and at
least 1% on Widget Captioning.
Related papers
- Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously.
We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z) - Improving Language Understanding from Screenshots [56.40401271149811]
An emerging family of language models (LMs) can process both text and images within a single visual view.
Existing screenshot LMs lag behind text-only models on language understanding tasks.
We propose a novel Patch-and-Text Prediction objective, which masks and recovers both image patches of screenshots and text within screenshots.
arXiv Detail & Related papers (2024-02-21T19:01:03Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
Pre-trained Knowledge [44.31783230767321]
We propose a plug-and-play framework, i.e. CapEnrich, to complement the generic image descriptions with more semantic details.
Our method significantly improves the descriptiveness and diversity of generated sentences for web images.
arXiv Detail & Related papers (2022-11-17T06:55:49Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [207.52609682812147]
We propose a new learning method Oscar (Object-Semantics Aligned Pre-training)
It uses object tags detected in images as anchor points to significantly ease the learning of alignments.
We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks.
arXiv Detail & Related papers (2020-04-13T19:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.