Related papers: FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

URL: http://arxiv.org/abs/2511.12215v1
Date: Sat, 15 Nov 2025 13:37:21 GMT
Title: FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention
Authors: Peng Zhang, Zhihui Lai, Wenting Chen, Xu Wu, Heng Kong,
Abstract summary: False Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment.<n>FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation.
Score: 19.49398094732301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

Related papers

Representation Learning with Semantic-aware Instance and Sparse Token Alignments [2.1008762019705434]
We propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA)<n>We improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives.<n>Our framework achieves significant improvements in fine-grained tasks even with limited labeled data.
arXiv Detail & Related papers (2026-01-13T02:55:48Z)
TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation [9.906359339999039]
We introduce a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model.<n>We propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces.<n>Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts.
arXiv Detail & Related papers (2025-08-08T16:51:44Z)
Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments [7.9714765680840625]
We propose a language-guided segmentation network with Target-informed Multi-level Contrastive Alignments (TMCA)<n>TMCA enables target-informed cross-modality alignments and fine-grained text guidance to bridge the pattern gaps in language-guided segmentation.
arXiv Detail & Related papers (2024-12-18T06:19:03Z)
Robust image representations with counterfactual contrastive learning [17.273155534515393]
We introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis.<n>Our method, evaluated across five datasets, outperforms standard contrastive learning in terms of robustness to acquisition shift.<n>Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning reducing subgroup disparities across biological sex.
arXiv Detail & Related papers (2024-09-16T15:11:00Z)
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples. We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z)
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff. In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z)
Weakly Supervised Vision-and-Language Pre-training with Relative Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training. Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains. Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains. We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Content-Preserving Unpaired Translation from Simulated to Realistic Ultrasound Images [12.136874314973689]
We introduce a novel image translation framework to bridge the appearance gap between simulated images and real scans. We achieve this goal by leveraging both simulated images with semantic segmentations and unpaired in-vivo ultrasound scans.
arXiv Detail & Related papers (2021-03-09T22:35:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.