TULIP: Token-length Upgraded CLIP
- URL: http://arxiv.org/abs/2410.10034v2
- Date: Fri, 28 Mar 2025 16:47:25 GMT
- Title: TULIP: Token-length Upgraded CLIP
- Authors: Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek,
- Abstract summary: We address the challenge of representing long captions in vision-language models, such as CLIP.<n>By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens.<n>We propose a generalizable method, named T, able to upgrade the token length to any length for CLIP-like models.
- Score: 57.818513403100326
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We address the challenge of representing long captions in vision-language models, such as CLIP. By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens and hindering performance on tasks requiring longer descriptions. Although recent work has attempted to overcome this limit, their proposed approaches struggle to model token relationships over longer distances and simply extend to a fixed new token length. Instead, we propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images. By effectively encoding captions longer than the default 77 tokens, our model outperforms baselines on cross-modal tasks such as retrieval and text-to-image generation. The code repository is available at https://github.com/ivonajdenkoska/tulip.
Related papers
- FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input.
FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens.
We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z) - ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We then help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding.
Our method demonstrates superior performance in long-text-image retrieval tasks.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Long-CLIP: Unlocking the Long-Text Capability of CLIP [47.13547303843929]
Long-CLIP is a plug-and-play alternative to Contrastive Language-Image Pre-training.
Long-CLIP supports long-text input, retains or even surpasses its zero-shot generalizability.
It offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
arXiv Detail & Related papers (2024-03-22T17:58:16Z) - A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions [9.87625120950535]
We collect the Densely Captioned Images dataset, containing 7805 natural images human-annotated with mask-aligned descriptions.
With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' understanding of image content.
We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark.
arXiv Detail & Related papers (2023-12-14T00:42:23Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.