Related papers: Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

URL: http://arxiv.org/abs/2409.01936v1
Date: Tue, 3 Sep 2024 14:33:01 GMT
Title: Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment
Authors: Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung,
Abstract summary: Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval. CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
Score: 0.7499722271664144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.

Related papers

Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z)
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm. We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z)
Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers [58.50071292008407]
We present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts.
arXiv Detail & Related papers (2024-08-12T21:35:59Z)
CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval [2.381261552604303]
We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy.
arXiv Detail & Related papers (2024-06-19T08:15:10Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT. We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z)
Variational Distribution Learning for Unsupervised Text-to-Image Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z)
Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening [53.1711708318581]
Current image-text retrieval methods suffer from $N$-related time complexity. This paper presents a simple and effective keyword-guided pre-screening framework for the image-text retrieval.
arXiv Detail & Related papers (2023-03-14T09:36:42Z)
Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.