Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models
- URL: http://arxiv.org/abs/2501.14276v1
- Date: Fri, 24 Jan 2025 06:42:06 GMT
- Title: Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models
- Authors: Yuxuan Liang, Xu Li, Xiaolei Chen, Haotian Chen, Yi Zheng, Chenghang Lai, Bin Li, Xiangyang Xue,
- Abstract summary: Sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability.<n>Global Semantic-guided Weight Allocator (GSWA) module allocates weights to sub-images based on their relative information density.<n>SleighVL, a lightweight yet high-performing model, outperforms models with comparable parameters and remains competitive with larger models.
- Score: 50.98559225639266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA into the InternVL2-2B framework to create SleighVL, a lightweight yet high-performing model. Extensive experiments demonstrate that SleighVL outperforms models with comparable parameters and remains competitive with larger models. Our work provides a promising direction for more efficient and contextually aware high-resolution image processing in LVLMs, advancing multimodal system development.
Related papers
- ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction [55.21514454560188]
Unified multimodal models significantly improve visual generation by combining vision-grained models (VLMs) with diffusion models.<n>Existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference.<n>We propose textbfParaUni, which extracts features from variants VLM's layers in a textbfParallel way for comprehensive information interaction.
arXiv Detail & Related papers (2025-12-05T04:41:57Z) - GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z) - Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models [2.984679075401059]
This paper presents the Multi-Modal Explainable Learning framework, designed to enhance the interpretability of vision-language models.<n>Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities.<n>We show that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations.
arXiv Detail & Related papers (2025-09-17T18:18:59Z) - Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.
We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching [0.8611782340880084]
This study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE)<n>This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel.<n>In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy to dynamically adjust the weight according to the loss value itself.
arXiv Detail & Related papers (2024-12-26T11:46:22Z) - Hierarchical Information Flow for Generalized Efficient Image Restoration [108.83750852785582]
We propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR.
Hi-IR constructs a hierarchical information tree representing the degraded image across three levels.
In seven common image restoration tasks, Hi-IR achieves its effectiveness and generalizability.
arXiv Detail & Related papers (2024-11-27T18:30:08Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution [6.546896650921257]
We propose HiTSR, a hierarchical transformer model for reference-based image super-resolution.
We streamline the architecture and training pipeline by incorporating the double attention block from GAN literature.
Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109.
arXiv Detail & Related papers (2024-08-30T01:16:29Z) - INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception.
We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective.
Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.