Related papers: Long-CLIP: Unlocking the Long-Text Capability of CLIP

Long-CLIP: Unlocking the Long-Text Capability of CLIP

URL: http://arxiv.org/abs/2403.15378v3
Date: Mon, 22 Jul 2024 06:10:41 GMT
Title: Long-CLIP: Unlocking the Long-Text Capability of CLIP
Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang,
Abstract summary: Long-CLIP is a plug-and-play alternative to Contrastive Language-Image Pre-training. Long-CLIP supports long-text input, retains or even surpasses its zero-shot generalizability. It offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
Score: 47.13547303843929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

Related papers

FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text [13.888406804533535]
We propose FIX-CLIP, which includes three novel modules.<n>A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively.<n>Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction.
arXiv Detail & Related papers (2025-07-14T09:31:34Z)
un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP [75.19266107565109]
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks.<n>This work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible.
arXiv Detail & Related papers (2025-05-30T12:29:38Z)
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation [4.063715077687089]
Distill CLIP (DCLIP) is a fine-tuned variant of the CLIP model.<n>It enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities.
arXiv Detail & Related papers (2025-05-25T07:08:07Z)
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z)
TULIP: Token-length Upgraded CLIP [57.818513403100326]
We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens. We propose a generalizable method, named T, able to upgrade the token length to any length for CLIP-like models.
arXiv Detail & Related papers (2024-10-13T22:34:15Z)
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z)
Linear Alignment of Vision-language Models for Image Captioning [8.921774238325566]
We propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. We also propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment.
arXiv Detail & Related papers (2023-07-10T17:59:21Z)
Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites. We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z)
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders. We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts. We introduce a novel pre-training framework, that learns importance-aware lexicon representations. Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z)
CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.