Related papers: Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

URL: http://arxiv.org/abs/2511.07710v2
Date: Wed, 19 Nov 2025 08:39:44 GMT
Title: Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling
Authors: Jiale Liu, Haoming Zhou, Yishu Zhu, Bingzhi Chen, Yuncheng Jiang,
Abstract summary: Fine-grained image-text alignment is a pivotal challenge in multimodal learning.<n>We propose a unified approach that incorporates significance-aware and region-level uncertainty modeling.<n>Our approach achieves state-of-the-art performance across various backbone architectures.
Score: 17.78769812974246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

Related papers

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation [8.834663340762562]
Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery.<n>Existing methods typically employ uniform fusion and refinement strategies across the entire image.<n>We propose an textbfuncertainty-guided framework that explicitly leverages a pixel-wise Referrbfreferring uncertainty map as a spatial prior to orchestrate adaptive inference.
arXiv Detail & Related papers (2026-01-07T01:02:39Z)
SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment [8.657941729790599]
We introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity.<n>Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches.<n>Experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance.
arXiv Detail & Related papers (2025-11-03T09:41:32Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution [38.794214985205045]
We propose a novel method to generate clear and accurate regional details in super-resolution images.<n>The method explicitly extracts localized fine-grained information and encodes it through a novel regional attention mechanism.<n> Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details.
arXiv Detail & Related papers (2025-08-22T07:28:34Z)
CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis [0.6961946145048322]
This paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion.<n>The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation.<n> evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.
arXiv Detail & Related papers (2025-07-21T11:49:57Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images. Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding. We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties. Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.