InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks
- URL: http://arxiv.org/abs/2312.14238v3
- Date: Mon, 15 Jan 2024 15:23:55 GMT
- Title: InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks
- Authors: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing,
Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu,
Yu Qiao, Jifeng Dai
- Abstract summary: We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
- Score: 92.03764152132315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multimodal AGI systems. However, the progress in vision and
vision-language foundation models, which are also critical elements of
multi-modal AGI, has not kept pace with LLMs. In this work, we design a
large-scale vision-language foundation model (InternVL), which scales up the
vision foundation model to 6 billion parameters and progressively aligns it
with the LLM, using web-scale image-text data from various sources. This model
can be broadly applied to and achieve state-of-the-art performance on 32
generic visual-linguistic benchmarks including visual perception tasks such as
image-level or pixel-level recognition, vision-language tasks such as zero-shot
image/video classification, zero-shot image/video-text retrieval, and link with
LLMs to create multi-modal dialogue systems. It has powerful visual
capabilities and can be a good alternative to the ViT-22B. We hope that our
research could contribute to the development of multi-modal large models. Code
and models are available at https://github.com/OpenGVLab/InternVL.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Libra: Building Decoupled Vision System on Large Language Models [63.28088885230901]
We introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM)
The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension.
arXiv Detail & Related papers (2024-05-16T14:34:44Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.