Related papers: Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

URL: http://arxiv.org/abs/2510.09741v1
Date: Fri, 10 Oct 2025 17:57:06 GMT
Title: Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Authors: Dwip Dalal, Gautam Vashishtha, Utkarsh Mishra, Jeonghwan Kim, Madhav Kanda, Hyeonjeong Ha, Svetlana Lazebnik, Heng Ji, Unnat Jain,
Abstract summary: AttWarp is a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas.<n>At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image.<n>This attention-guided warping preserves all original image information but redistributes it non-uniformly.
Score: 43.14520214157644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

Related papers

Exploring MLLM-Diffusion Information Transfer with MetaCanvas [66.28602082523464]
We propose a lightweight framework that lets MLLMs reason and plan directly in spatial and multimodal latent spaces.<n>We evaluate it across six visual generation tasks, including text-to-image generation, text/image-to-video generation, image/video attribute editing, and in-context video generation.
arXiv Detail & Related papers (2025-12-12T11:07:11Z)
Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z)
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation [60.164968941945645]
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives.<n>LLaVA-Reward directly utilizes the hidden states of multimodal large language models (MLLMs)<n>We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking.
arXiv Detail & Related papers (2025-07-28T23:52:53Z)
Demystifying the Visual Quality Paradox in Multimodal Large Language Models [49.154146792279946]
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses.<n>We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks.<n>We uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity.
arXiv Detail & Related papers (2025-06-18T17:14:07Z)
QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining [28.2730962800806]
We propose a drop-in replacement for CLIP vision encoders that can be seamlessly integrated with existing MLLMs.<n>QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes.<n> Notably, QLIP boosts detailed understanding performance on the challenging $Vast$ benchmark by up to 13.6 percent.
arXiv Detail & Related papers (2025-05-29T02:26:34Z)
Can Multimodal Large Language Models Understand Spatial Relations? [16.76001474065412]
We introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO 2017.<n>Results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.
arXiv Detail & Related papers (2025-05-25T07:37:34Z)
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration [22.39558434131574]
Large Vision-Language Models (LVLMs) generate responses that are not factually aligned with the visual content.<n>We introduce a training-free solution, Uniform Attention (UAC), that estimates the bias from single meaningless input image.<n>We also introduce a fine-tuning solution, Dynamic Attention (DAC), that enforces the consistent outputs wherever the object locates in the image.
arXiv Detail & Related papers (2025-02-04T03:27:38Z)
Modality-Fair Preference Optimization for Trustworthy MLLM Alignment [22.093944381988496]
Multimodal large language models (MLLMs) have achieved remarkable success across various tasks.<n>However, separate training of visual and textual encoders often results in a misalignment of the modality.<n>These inaccuracies severely undermine the trustworthiness of MLLMs in real-world applications.
arXiv Detail & Related papers (2024-10-20T08:56:52Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models [44.437693135170576]
We propose a new framework, LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME) We extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks. The proposed method achieves leading performance across various benchmarks with only 2 million training data.
arXiv Detail & Related papers (2024-06-12T17:59:49Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.