Toward Building General Foundation Models for Language, Vision, and
Vision-Language Understanding Tasks
- URL: http://arxiv.org/abs/2301.05065v2
- Date: Tue, 17 Oct 2023 16:11:36 GMT
- Title: Toward Building General Foundation Models for Language, Vision, and
Vision-Language Understanding Tasks
- Authors: Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li
- Abstract summary: We propose a new general foundation model, X-FM (the X-Foundation Model)
X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method.
X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
- Score: 27.450456238980433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models or pre-trained models have substantially improved the
performance of various language, vision, and vision-language understanding
tasks. However, existing foundation models can only perform the best in one
type of tasks, namely language, vision, or vision-language. It is still an open
question whether it is possible to construct a foundation model performing the
best for all the understanding tasks, which we call a general foundation model.
In this paper, we propose a new general foundation model, X-FM (the
X-Foundation Model). X-FM has one language encoder, one vision encoder, and one
fusion encoder, as well as a new training method. The training method includes
two new techniques for learning X-FM from text, image, and image-text pair
data. One is to stop gradients from the vision-language training when learning
the language encoder. The other is to leverage the vision-language training to
guide the learning of the vision encoder. Extensive experiments on benchmark
datasets show that X-FM can significantly outperform existing general
foundation models and perform better than or comparable to existing foundation
models specifically for language, vision, or vision-language understanding.
Code and pre-trained models are released at
https://github.com/zhangxinsong-nlp/XFM.
Related papers
- Renaissance: Investigating the Pretraining of Vision-Language Encoders [0.6445605125467574]
We seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis.
In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining.
In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model.
arXiv Detail & Related papers (2024-11-11T01:44:54Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - Is Multimodal Vision Supervision Beneficial to Language? [2.216702991322677]
Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks.
We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
arXiv Detail & Related papers (2023-02-10T02:22:44Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
We propose to learn multi-grained vision language alignments by a unified pre-training framework.
X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.