HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
- URL: http://arxiv.org/abs/2511.22594v1
- Date: Thu, 27 Nov 2025 16:24:53 GMT
- Title: HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
- Authors: Haoxi Zeng, Haoxuan Li, Yi Bin, Pengpeng Zeng, Xing Xu, Yang Yang, Heng Tao Shen,
- Abstract summary: HarmoCLIP is a novel framework designed to harmonize global and region representations within Contrastive Language-Image Pre-training.<n>To strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy.
- Score: 63.87966115136411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.
Related papers
- LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation [12.192429756057132]
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories.<n>LoGoSeg integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context.
arXiv Detail & Related papers (2026-02-05T12:03:11Z) - SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation [19.962252347191523]
This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches.<n>Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment.<n>At the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation.
arXiv Detail & Related papers (2025-12-01T14:06:50Z) - Enhancing CLIP Robustness via Cross-Modality Alignment [54.01929554563447]
We propose Cross-modality Alignment, an optimal transport-based framework for vision-language models.<n> COLA restores global image-text alignment and local structural consistency in the feature space.<n> COLA is training-free and compatible with existing fine-tuned models.
arXiv Detail & Related papers (2025-10-28T03:47:44Z) - GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework designed for robust visual speech recognition (VSR)<n>GLip learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data.<n>In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context.
arXiv Detail & Related papers (2025-09-19T14:36:01Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme [4.247494613524502]
Semi-Supervised Semantic (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data.<n>We propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT.<n>Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning.
arXiv Detail & Related papers (2025-04-14T01:51:29Z) - Refining CLIP's Spatial Awareness: A Visual-Centric Perspective [10.936397225984107]
Contrastive Language-Image Pre-training excels in global alignment with language but exhibits limited sensitivity to spatial information.<n>Recent approaches have introduced Region-Language Alignment to enhance CLIP's performance in dense multimodal tasks.<n>We propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation.
arXiv Detail & Related papers (2025-04-03T07:04:56Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Progressive Feature Self-reinforcement for Weakly Supervised Semantic
Segmentation [55.69128107473125]
We propose a single-stage approach for Weakly Supervised Semantic (WSSS) with image-level labels.
We adaptively partition the image content into deterministic regions (e.g., confident foreground and background) and uncertain regions (e.g., object boundaries and misclassified categories) for separate processing.
Building upon this, we introduce a complementary self-enhancement method that constrains the semantic consistency between these confident regions and an augmented image with the same class labels.
arXiv Detail & Related papers (2023-12-14T13:21:52Z) - Region-Based Semantic Factorization in GANs [67.90498535507106]
We present a highly efficient algorithm to factorize the latent semantics learned by Generative Adversarial Networks (GANs) concerning an arbitrary image region.
Through an appropriately defined generalized Rayleigh quotient, we solve such a problem without any annotations or training.
Experimental results on various state-of-the-art GAN models demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-02-19T17:46:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.