Related papers: LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

URL: http://arxiv.org/abs/2412.13871v1
Date: Wed, 18 Dec 2024 14:07:46 GMT
Title: LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
Authors: Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun,
Abstract summary: We present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer.<n>As a vision-language projector, Hiwin transformer comprises two primary modules.<n>Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks.
Score: 109.61952368100756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In multimodal large language models (MLLMs), vision transformers (ViTs) are widely employed for visual encoding. However, their performance in solving universal MLLM tasks is not satisfactory. We attribute it to a lack of information from diverse visual levels, impeding alignment with the various semantic granularity required for language generation. To address this issue, we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high-resolution feature pyramid. As a vision-language projector, Hiwin transformer comprises two primary modules: (i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid, and (ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps. Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We make all the data, model checkpoint, and code publicly available to facilitate future research.

Related papers

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs [4.478610052538001]
We propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation.<n>We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling.
arXiv Detail & Related papers (2025-07-01T07:20:11Z)
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs [61.903626952650605]
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks.<n>We propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts.
arXiv Detail & Related papers (2025-06-13T07:16:41Z)
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer [68.71557348281007]
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs.
arXiv Detail & Related papers (2025-04-14T17:50:20Z)
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding [36.376220619032225]
REF-VLM is an end-to-end framework for unified training of various visual decoding tasks. We construct a large-scale multi-task dataset containing over 100 million multimodal dialogue samples. REF-VLM outperforms other MLLMs across a variety of standard benchmarks.
arXiv Detail & Related papers (2025-03-10T14:59:14Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.<n>We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z)
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks. LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models [70.25499865569353]
We introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert. Our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench.
arXiv Detail & Related papers (2024-03-20T09:42:43Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model. We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.