VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
- URL: http://arxiv.org/abs/2403.00522v2
- Date: Mon, 8 Jul 2024 02:48:27 GMT
- Title: VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
- Authors: Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen,
- Abstract summary: We unveil a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose.
VisionLLaMA is a unified and generic modelling framework for solving most vision tasks.
- Score: 60.22144823791902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.
Related papers
- PUMA: Empowering Unified MLLM with Multi-granular Visual Generation [62.747751204215916]
We propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation.
PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs.
This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks.
arXiv Detail & Related papers (2024-10-17T17:59:57Z) - Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Masked Vision-Language Transformer in Fashion [85.6143169850834]
Masked vision-language transformer (MVLT) for fashion-specific multi-modal representation.
MVLT is an and convenient architecture that admits raw multi-modal inputs without extra pre-processing models.
More importantly, MVLT can easily generalize to various matching and generative tasks.
arXiv Detail & Related papers (2022-10-27T01:44:08Z) - Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN [38.87225202482656]
Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers.
We propose an Architecture-Agnostic Masked Image Modeling framework (A$2$MIM), which is compatible with both Transformers and CNNs in a unified way.
arXiv Detail & Related papers (2022-05-27T12:42:02Z) - On Vision Features in Multimodal Machine Translation [34.41229863267296]
We develop a selective attention model to study the patch-level contribution of an image in multimodal machine translation.
Our results suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.
arXiv Detail & Related papers (2022-03-17T08:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.