Expediting Contrastive Language-Image Pretraining via Self-distilled
Encoders
- URL: http://arxiv.org/abs/2312.12659v1
- Date: Tue, 19 Dec 2023 23:11:06 GMT
- Title: Expediting Contrastive Language-Image Pretraining via Self-distilled
Encoders
- Authors: Bumsoo Kim, Jinhyung Kim, Yeonsik Jo, Seung Hwan Kim
- Abstract summary: ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder.
Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder.
- Score: 10.649402840032138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision language pretraining (VLP) have been largely
attributed to the large-scale data collected from the web. However, uncurated
dataset contains weakly correlated image-text pairs, causing data inefficiency.
To address the issue, knowledge distillation have been explored at the expense
of extra image and text momentum encoders to generate teaching signals for
misaligned image-text pairs. In this paper, our goal is to resolve the
misalignment problem with an efficient distillation framework. To this end, we
propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with
Self-distilled Encoders. ECLIPSE features a distinctive distillation
architecture wherein a shared text encoder is utilized between an online image
encoder and a momentum image encoder. This strategic design choice enables the
distillation to operate within a unified projected space of text embedding,
resulting in better performance. Based on the unified text embedding space,
ECLIPSE compensates for the additional computational cost of the momentum image
encoder by expediting the online image encoder. Through our extensive
experiments, we validate that there is a sweet spot between expedition and
distillation where the partial view from the expedited online image encoder
interacts complementarily with the momentum teacher. As a result, ECLIPSE
outperforms its counterparts while achieving substantial acceleration in
inference speed.
Related papers
- A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval [34.065449743428005]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches.
We introduce the Reducing Task Discrepancy of text encoders for Composed Image Retrieval (RTD), a plug-and-play training scheme for the text encoder.
We also propose two additional techniques to improve the proposed learning scheme: a hard negatives-based refined batch sampling strategy and a sophisticated concatenation scheme.
arXiv Detail & Related papers (2024-06-13T14:49:28Z) - Perceptual Image Compression with Cooperative Cross-Modal Side
Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff.
Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
Retrieval [117.15862403330121]
We propose LoopITR, which combines dual encoders and cross encoders in the same network for joint learning.
Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder.
arXiv Detail & Related papers (2022-03-10T16:41:12Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - EncoderMI: Membership Inference against Pre-trained Encoders in
Contrastive Learning [27.54202989524394]
We proposeMI, the first membership inference method against image encoders pre-trained by contrastive learning.
We evaluateMI on image encoders pre-trained on multiple datasets by ourselves as well as the Contrastive Language-Image Pre-training (CLIP) image encoder, which is pre-trained on 400 million (image, text) pairs collected from the Internet and released by OpenAI.
arXiv Detail & Related papers (2021-08-25T03:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.