Related papers: UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

URL: http://arxiv.org/abs/2209.13430v1
Date: Tue, 27 Sep 2022 14:36:16 GMT
Title: UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
Authors: Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, Junmo Kim
Abstract summary: We propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks.
Score: 62.97551575508387
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance.

Related papers

Disentangling CLIP for Multi-Object Perception [58.73850193789384]
Vision-language models like CLIP excel at recognizing the single, prominent object in a scene, but struggle in complex scenes containing multiple objects.<n>We propose DCLIP, a framework that disentangles CLIP features using two complementary objectives.<n>Our experiment demonstrates that DCLIP reduces inter-class feature similarity by 30% compared to CLIP.
arXiv Detail & Related papers (2025-02-05T08:20:31Z)
Uni-Sign: Toward Unified Sign Language Understanding at Scale [90.76641997060513]
We propose a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks. Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks.
arXiv Detail & Related papers (2025-01-25T11:51:23Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
Joint semi-supervised and contrastive learning enables zero-shot domain-adaptation and multi-domain segmentation [1.5393913074555419]
SegCLR is a versatile framework designed to segment volumetric images across different domains. We demonstrate the superior performance of SegCLR through a comprehensive evaluation.
arXiv Detail & Related papers (2024-05-08T18:10:59Z)
RankCLIP: Ranking-Consistent Language-Image Pretraining [7.92247304974314]
RANKCLIP is a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP. By extending the traditional pair-wise loss to list-wise, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality.
arXiv Detail & Related papers (2024-04-15T00:12:27Z)
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation [25.499205902426716]
We introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. We craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information.
arXiv Detail & Related papers (2024-03-11T17:33:12Z)
One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z)
Generalized Few-Shot Continual Learning with Contrastive Mixture of Adapters [59.82088750033897]
We set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental situations. We find that common continual learning methods have poor generalization ability on unseen domains. In this way, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA)
arXiv Detail & Related papers (2023-02-12T15:18:14Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping. ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.