Related papers: CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

URL: http://arxiv.org/abs/2508.06434v2
Date: Thu, 25 Sep 2025 12:29:48 GMT
Title: CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Authors: Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li,
Abstract summary: Large-scale natural image-text datasets often suffer from loose semantic alignment due to weak supervision.<n>We propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures.<n>Two shared robustness pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning.
Score: 28.2773807732662
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

Related papers

CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision [0.08699280339422537]
We propose CLIP-Joint-Detect, a framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training.<n>A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term.<n>We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS 2017 benchmark using modern YOLO detectors (YOLOv11)
arXiv Detail & Related papers (2025-12-28T15:21:20Z)
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition [55.189113121465816]
We propose a novel correlation adaptation prompt network (CAPNET) for long-tailed multi-label visual recognition.<n>CAPNET explicitly models correlations from CLIP's textual encoder.<n>It improves generalization through test-time ensembling and realigns visual-textual modalities.
arXiv Detail & Related papers (2025-11-25T18:57:28Z)
PANICL: Mitigating Over-Reliance on Single Prompt in Visual In-Context Learning [38.70994263844236]
Visual In-Context Learning (VICL) uses input-output image pairs, referred to as in-context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks.<n>VICL often suffers from over-reliance on a single in-context pair, which can lead to biased and unstable predictions.<n>We introduce PAtch-based $k$-Nearest neighbor visual In-Context Learning (PANICL), a general training-free framework that mitigates this issue by leveraging multiple in-context pairs.
arXiv Detail & Related papers (2025-09-26T06:13:40Z)
CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [17.05291662808873]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z)
Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors [50.7383184560431]
Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting.<n>We propose a concise CL approach for CLIP based on incremental prompt tuning.<n>We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting.
arXiv Detail & Related papers (2025-05-27T03:51:37Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
RankCLIP: Ranking-Consistent Language-Image Pretraining [7.92247304974314]
RankCLIP is a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP.<n>By extending the traditional pair-wise loss to list-wise, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality.
arXiv Detail & Related papers (2024-04-15T00:12:27Z)
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z)
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations.<n>Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Semantically Contrastive Learning for Low-light Image Enhancement [48.71522073014808]
Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weak-visibility problems of single RGB images. We propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE) Our method surpasses the state-of-the-arts LLE models over six independent cross-scenes datasets.
arXiv Detail & Related papers (2021-12-13T07:08:33Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.