Related papers: LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

URL: http://arxiv.org/abs/2312.00674v1
Date: Fri, 1 Dec 2023 15:54:55 GMT
Title: LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
Authors: Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe Wang
Abstract summary: We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
Score: 45.672539931681065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.

Related papers

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining. It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z)
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm. We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval. CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z)
Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities. CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure. We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z)
Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.