CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
- URL: http://arxiv.org/abs/2508.06434v2
- Date: Thu, 25 Sep 2025 12:29:48 GMT
- Title: CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
- Authors: Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li,
- Abstract summary: Large-scale natural image-text datasets often suffer from loose semantic alignment due to weak supervision.<n>We propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures.<n>Two shared robustness pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning.
- Score: 28.2773807732662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.
Related papers
- CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision [0.08699280339422537]
We propose CLIP-Joint-Detect, a framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training.<n>A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term.<n>We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS 2017 benchmark using modern YOLO detectors (YOLOv11)
arXiv Detail & Related papers (2025-12-28T15:21:20Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z) - PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning [38.70994263844236]
Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks.<n>VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions.<n>We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs.
arXiv Detail & Related papers (2025-09-26T06:13:40Z) - CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [17.05291662808873]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z) - Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors [50.7383184560431]
Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting.<n>We propose a concise CL approach for CLIP based on incremental prompt tuning.<n>We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting.
arXiv Detail & Related papers (2025-05-27T03:51:37Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - RankCLIP: Ranking-Consistent Language-Image Pretraining [7.92247304974314]
RankCLIP is a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP.<n>By extending the traditional pair-wise loss to list-wise, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality.
arXiv Detail & Related papers (2024-04-15T00:12:27Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.<n>Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Semantically Contrastive Learning for Low-light Image Enhancement [48.71522073014808]
Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weak-visibility problems of single RGB images.
We propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE)
Our method surpasses the state-of-the-arts LLE models over six independent cross-scenes datasets.
arXiv Detail & Related papers (2021-12-13T07:08:33Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.