When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
- URL: http://arxiv.org/abs/2503.07588v2
- Date: Tue, 25 Mar 2025 15:05:34 GMT
- Title: When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
- Authors: Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li,
- Abstract summary: Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.<n>We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.<n>Our method outperforms existing high-resolution strategies on four datasets using the same data.
- Score: 31.696397337675847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.
Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.
In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution [42.76046559103463]
Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR)<n>These methods limit attention to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency.<n>We propose a lightweight Content-Aware Token Aggregation Network (CATANet) to address these issues.<n>Our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
arXiv Detail & Related papers (2025-03-10T04:00:27Z) - GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing [32.85223015863783]
GeoPixel is an end-to-end high resolution RS-LMM that supports pixel-level grounding.<n>It supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis.<n>GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks.
arXiv Detail & Related papers (2025-01-23T18:59:30Z) - Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG [24.342190878813234]
ImageRAG for RS is a training-free framework to address the complexities of analyzing UHR remote sensing imagery.
ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts.
arXiv Detail & Related papers (2024-11-12T10:12:12Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text
Image Super-Resolution [22.60056946339325]
We propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling.
The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features.
We demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy.
arXiv Detail & Related papers (2023-09-16T08:12:12Z) - {\mu}Split: efficient image decomposition for microscopy data [50.794670705085835]
muSplit is a dedicated approach for trained image decomposition in the context of fluorescence microscopy images.
We introduce lateral contextualization (LC), a novel meta-architecture that enables the memory efficient incorporation of large image-context.
We apply muSplit to five decomposition tasks, one on a synthetic dataset, four others derived from real microscopy data.
arXiv Detail & Related papers (2022-11-23T11:26:24Z) - RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing
Data [14.742224345061487]
We introduce the task of visual grounding for remote sensing data (RSVG)
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language.
In this work, we construct a large-scale benchmark dataset of RSVG and explore deep learning models for the RSVG task.
arXiv Detail & Related papers (2022-10-23T07:08:22Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.