Related papers: LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

URL: http://arxiv.org/abs/2403.11703v1
Date: Mon, 18 Mar 2024 12:04:11 GMT
Title: LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Authors: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang,
Abstract summary: We present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks.
Score: 119.24323184581974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

Related papers

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model [40.488271586857884]
AndesVL is a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders.<n>We introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning framework to facilitate efficient task adaptation and model compression.<n>We achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips.
arXiv Detail & Related papers (2025-10-13T15:04:38Z)
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token [43.13591088045511]
We introduce LLaVA-Mini, an efficient large multimodal model with minimal vision tokens. LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
arXiv Detail & Related papers (2025-01-07T16:03:14Z)
FastVLM: Efficient Vision Encoding for Vision Language Models [22.41836943083826]
We introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
arXiv Detail & Related papers (2024-12-17T20:09:55Z)
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information [4.33464415015353]
We propose Dynamic Feature Map Reduction (DFMR) based on LLaVA-1.5 to address the challenge of visual token overload.<n> DFMR dynamically compresses the visual tokens, freeing up token capacity.<n>Our experimental results demonstrate that integrating DFMR into LLaVA-1.5 significantly improves the performance of LLaVA in varied visual token lengths.
arXiv Detail & Related papers (2024-12-11T20:46:06Z)
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z)
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models [70.2997884478129]
We introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs.
arXiv Detail & Related papers (2024-07-10T17:59:43Z)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM) It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z)
On Speculative Decoding for Multimodal Large Language Models [11.245862832561176]
Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B.
arXiv Detail & Related papers (2024-04-13T00:02:36Z)
Matryoshka Diffusion Models [38.26966802461602]
Diffusion models are the de facto approach for generating high-quality images and videos. We introduce Matryoshka Diffusion Models, an end-to-end framework for high-resolution image and video synthesis. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications.
arXiv Detail & Related papers (2023-10-23T17:20:01Z)
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers. EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.