Related papers: CLIPSE -- a minimalistic CLIP-based image search engine for research

CLIPSE -- a minimalistic CLIP-based image search engine for research

URL: http://arxiv.org/abs/2504.17643v1
Date: Thu, 24 Apr 2025 15:13:37 GMT
Title: CLIPSE -- a minimalistic CLIP-based image search engine for research
Authors: Steve Göring,
Abstract summary: In general, CLIPSE uses CLIP embeddings to process the images and also the text queries.<n>Two benchmark scenarios are described and evaluated, covering indexing and querying time.<n>It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.

Related papers

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP [75.19266107565109]
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks.<n>This work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible.
arXiv Detail & Related papers (2025-05-30T12:29:38Z)
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation [3.1667055223489786]
Contrastive Language-Image Pre-training models excel in zero-shot classification, yet face challenges in complex multi-object scenarios.<n>This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO.<n>Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects.
arXiv Detail & Related papers (2025-02-27T07:34:42Z)
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval [83.01358520910533]
We introduce a new framework that can boost the performance of large-scale pre-trained vision- curation models.<n>The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple mapping network, to predict a set of visual prompts.<n>ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks.
arXiv Detail & Related papers (2025-02-21T18:59:57Z)
Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task. Our approach incorporates learnable prompts to adapt to new instructions for ranking purposes. Our ranking-aware adapter consistently outperforms fine-tuned CLIPs on various tasks.
arXiv Detail & Related papers (2024-12-09T18:51:05Z)
CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval [2.381261552604303]
We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy.
arXiv Detail & Related papers (2024-06-19T08:15:10Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one. We use features from the OpenAI CLIP model to tackle the considered task. We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z)
GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning. Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z)
Injecting Image Details into CLIP's Feature Space [29.450159407113155]
We introduce an efficient framework that can produce a single feature representation for a high-resolution image. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets.
arXiv Detail & Related papers (2022-08-31T06:18:10Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Compact Deep Aggregation for Set Retrieval [87.52470995031997]
We focus on retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities. We show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that.
arXiv Detail & Related papers (2020-03-26T08:43:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.