FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
- URL: http://arxiv.org/abs/2504.01916v1
- Date: Wed, 02 Apr 2025 17:19:59 GMT
- Title: FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
- Authors: Mothilal Asokan, Kebin Wu, Fatima Albreiki,
- Abstract summary: FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input.<n>FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens.<n>We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
- Score: 0.351124620232225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.
Related papers
- GOAL: Global-local Object Alignment Learning [7.9061560322289335]
Vision-language models like CLIP have shown impressive capabilities in aligning images and text.
They often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions.
We present GOAL, a novel fine-tuning method that enhances CLIP's ability to handle lengthy text.
arXiv Detail & Related papers (2025-03-22T14:27:32Z) - DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP.<n>We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.
It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.
DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - TULIP: Token-length Upgraded CLIP [57.818513403100326]
We address the challenge of representing long captions in vision-language models, such as CLIP.<n>By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens.<n>We propose a generalizable method, named T, able to upgrade the token length to any length for CLIP-like models.
arXiv Detail & Related papers (2024-10-13T22:34:15Z) - MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance.
We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment.
MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Long-CLIP: Unlocking the Long-Text Capability of CLIP [47.13547303843929]
Long-CLIP is a plug-and-play alternative to Contrastive Language-Image Pre-training.
Long-CLIP supports long-text input, retains or even surpasses its zero-shot generalizability.
It offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
arXiv Detail & Related papers (2024-03-22T17:58:16Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Self-Supervised Image Captioning with CLIP [0.0]
We introduce a self-supervised image captioning method.
After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data.
Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
arXiv Detail & Related papers (2023-06-26T23:29:16Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.