HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs
- URL: http://arxiv.org/abs/2506.17608v1
- Date: Sat, 21 Jun 2025 06:13:56 GMT
- Title: HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs
- Authors: Nikitha SR, Aradhya Neeraj Mathur, Tarun Ram Menta, Rishabh Jain, Mausoom Sarkar,
- Abstract summary: We develop an intuition for feature upsampling as a natural extension of high-resolution feature generation.<n>We demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost.
- Score: 5.362066717455192
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.
Related papers
- HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation [74.1872891313184]
HRSeg is an efficient model with high-resolution fine-grained perception.<n>It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE)
arXiv Detail & Related papers (2025-07-17T08:09:31Z) - JAFAR: Jack up Any Feature at Any Resolution [53.343826346140624]
JAFAR is a lightweight and flexible feature upsampler for Foundation Visions.<n>It enhances the spatial resolution of visual features from any Foundation Vision to an arbitrary target resolution.<n>It generalizes remarkably well to significantly higher output scales.
arXiv Detail & Related papers (2025-06-10T20:53:12Z) - Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z) - FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression [45.37530855889661]
High-resolution images lead to a quadratic increase in the number of visual tokens input into Multi-modal Large Language Models.
Current work develop visual token compression methods to achieve efficiency improvements, often at the expense of performance.
We build a coarse-to-fine visual token compression method, with a vision-guided sampler for compressing redundant regions with low information density, and a text-guided sampler for selecting visual tokens that are strongly correlated with the user instructions.
arXiv Detail & Related papers (2024-11-21T15:37:52Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks [36.61645124563195]
We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions.
We use semantics-rich representations of lower-resolution images in the later denoising stage to guide the whole generation of highly detailed high-resolution images.
Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images.
arXiv Detail & Related papers (2024-07-02T11:02:19Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Wider and Higher: Intensive Integration and Global Foreground Perception
for Image Matting [44.51635913732913]
This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting.
Image matting is essentially a pixel-wise regression, and the ideal situation is to perceive the maximum opacity from the input image.
We propose an Intensive Integration and Global Foreground Perception network (I2GFP) to integrate wider and higher feature streams.
arXiv Detail & Related papers (2022-10-13T11:34:46Z) - Hybrid Pixel-Unshuffled Network for Lightweight Image Super-Resolution [64.54162195322246]
Convolutional neural network (CNN) has achieved great success on image super-resolution (SR)
Most deep CNN-based SR models take massive computations to obtain high performance.
We propose a novel Hybrid Pixel-Unshuffled Network (HPUN) by introducing an efficient and effective downsampling module into the SR task.
arXiv Detail & Related papers (2022-03-16T20:10:41Z) - Exploring Multi-Scale Feature Propagation and Communication for Image
Super Resolution [37.91175933401261]
We present a unified formulation over widely-used multi-scale structures.
We propose a generic and efficient multi-scale convolution unit -- Multi-Scale cross-Scale Share-weights convolution (MS$3$-Conv)
arXiv Detail & Related papers (2020-08-01T10:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.