LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
- URL: http://arxiv.org/abs/2403.11703v1
- Date: Mon, 18 Mar 2024 12:04:11 GMT
- Title: LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
- Authors: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang,
- Abstract summary: We present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution.
Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks.
- Score: 119.24323184581974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.
Related papers
- LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models [70.2997884478129]
We introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs.
We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs.
arXiv Detail & Related papers (2024-07-10T17:59:43Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - On Speculative Decoding for Multimodal Large Language Models [11.245862832561176]
Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone.
We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B.
arXiv Detail & Related papers (2024-04-13T00:02:36Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before
Projection [28.39885771124003]
We introduce Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.