VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks
- URL: http://arxiv.org/abs/2305.11175v2
- Date: Thu, 25 May 2023 15:02:07 GMT
- Title: VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks
- Authors: Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang
Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai
- Abstract summary: VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
- Score: 81.32968995346775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have notably accelerated progress towards
artificial general intelligence (AGI), with their impressive zero-shot capacity
for user-tailored tasks, endowing them with immense potential across a range of
applications. However, in the field of computer vision, despite the
availability of numerous powerful vision foundation models (VFMs), they are
still restricted to tasks in a pre-defined form, struggling to match the
open-ended task capabilities of LLMs. In this work, we present an LLM-based
framework for vision-centric tasks, termed VisionLLM. This framework provides a
unified perspective for vision and language tasks by treating images as a
foreign language and aligning vision-centric tasks with language tasks that can
be flexibly defined and managed using language instructions. An LLM-based
decoder can then make appropriate predictions based on these instructions for
open-ended tasks. Extensive experiments show that the proposed VisionLLM can
achieve different levels of task customization through language instructions,
from fine-grained object-level to coarse-grained task-level customization, all
with good results. It's noteworthy that, with a generalist LLM-based framework,
our model can achieve over 60\% mAP on COCO, on par with detection-specific
models. We hope this model can set a new baseline for generalist vision and
language models. The demo shall be released based on
https://github.com/OpenGVLab/InternGPT. The code shall be released at
https://github.com/OpenGVLab/VisionLLM.
Related papers
- VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - LM4LV: A Frozen Large Language Model for Low-level Vision Tasks [25.3601306724822]
$textbfLM4LV$ is a framework that enables a large language model to solve a range of low-level vision tasks without any multi-modal data or prior.
This showcases the LLM's strong potential in low-level vision and bridges the gap between MLLMs and low-level vision tasks.
arXiv Detail & Related papers (2024-05-24T17:25:00Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.