X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
- URL: http://arxiv.org/abs/2211.12402v2
- Date: Sun, 30 Jul 2023 13:20:13 GMT
- Title: X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
- Authors: Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang,
Wangchunshu Zhou
- Abstract summary: Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
We propose to learn multi-grained vision language alignments by a unified pre-training framework.
X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
- Score: 38.05496300873095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language pre-training aims to learn alignments between vision and
language from a large amount of data. Most existing methods only learn
image-text alignments. Some others utilize pre-trained object detectors to
leverage vision language alignments at the object level. In this paper, we
propose to learn multi-grained vision language alignments by a unified
pre-training framework that learns multi-grained aligning and multi-grained
localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one
model with a flexible modular architecture, in which we further unify
image-text pre-training and video-text pre-training in one model. X$^2$-VLM is
able to learn unlimited visual concepts associated with diverse text
descriptions. Experiment results show that X$^2$-VLM performs the best on base
and large scale for both image-text and video-text tasks, making a good
trade-off between performance and model scale. Moreover, we show that the
modular design of X$^2$-VLM results in high transferability for it to be
utilized in any language or domain. For example, by simply replacing the text
encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual
multi-modal pre-trained models without any multilingual pre-training. The code
and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.
Related papers
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z) - Toward Building General Foundation Models for Language, Vision, and
Vision-Language Understanding Tasks [27.450456238980433]
We propose a new general foundation model, X-FM (the X-Foundation Model)
X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method.
X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
arXiv Detail & Related papers (2023-01-12T15:03:05Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.