Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
- URL: http://arxiv.org/abs/2402.15120v1
- Date: Fri, 23 Feb 2024 06:11:50 GMT
- Title: Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
- Authors: Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran,
Franck Dernoncourt, Jaewoo Kang
- Abstract summary: We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
- Score: 83.3736789315201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive language-image pre-training (CLIP) models have demonstrated
considerable success across various vision-language tasks, such as
text-to-image retrieval, where the model is required to effectively process
natural language input to produce an accurate visual output. However, current
models still face limitations in dealing with linguistic variations in input
queries, such as paraphrases, making it challenging to handle a broad range of
user queries in real-world applications. In this study, we introduce a
straightforward fine-tuning approach to enhance the representations of CLIP
models for paraphrases. Our approach involves a two-step paraphrase generation
process, where we automatically create two categories of paraphrases from
web-scale image captions by leveraging large language models. Subsequently, we
fine-tune the CLIP text encoder using these generated paraphrases while
freezing the image encoder. Our resulting model, which we call ParaCLIP,
exhibits significant improvements over baseline CLIP models across various
tasks, including paraphrased retrieval (with rank similarity scores improved by
up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven
semantic textual similarity tasks.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic
Segmentation [55.575224613422726]
Contrastive Language-Image Pre-training(CLIP) has shown great promise in pixel-level open-vocabulary learning tasks.
Existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones.
We disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.