Renaissance: Investigating the Pretraining of Vision-Language Encoders
- URL: http://arxiv.org/abs/2411.06657v1
- Date: Mon, 11 Nov 2024 01:44:54 GMT
- Title: Renaissance: Investigating the Pretraining of Vision-Language Encoders
- Authors: Clayton Fields, Casey Kennington,
- Abstract summary: We seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis.
In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining.
In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model.
- Score: 0.6445605125467574
- License:
- Abstract: In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.
Related papers
- ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models [10.272476734387977]
We introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks.
We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities.
VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks.
arXiv Detail & Related papers (2024-06-19T09:07:31Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Toward Building General Foundation Models for Language, Vision, and
Vision-Language Understanding Tasks [27.450456238980433]
We propose a new general foundation model, X-FM (the X-Foundation Model)
X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method.
X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
arXiv Detail & Related papers (2023-01-12T15:03:05Z) - X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
We propose to learn multi-grained vision language alignments by a unified pre-training framework.
X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.