Composed Image Retrieval for Training-Free Domain Conversion
- URL: http://arxiv.org/abs/2412.03297v1
- Date: Wed, 04 Dec 2024 13:16:17 GMT
- Title: Composed Image Retrieval for Training-Free Domain Conversion
- Authors: Nikos Efthymiadis, Bill Psomas, Zakaria Laskar, Konstantinos Karantzalos, Yannis Avrithis, Ondřej Chum, Giorgos Tolias,
- Abstract summary: We show that a strong vision-language model provides sufficient descriptive power without additional training.
The query image is mapped to the text input space using textual inversion.
Our method outperforms prior art by a large margin on standard and newly introduced benchmarks.
- Score: 18.347643780858284
- License:
- Abstract: This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: https://github.com/NikosEfth/freedom
Related papers
- Composed Image Retrieval for Remote Sensing [24.107610091033997]
This work introduces composed image retrieval to remote sensing.
It allows to query a large image archive by image examples alternated by a textual description.
A novel method fusing image-to-image and text-to-image similarity is introduced.
arXiv Detail & Related papers (2024-05-24T14:18:31Z) - Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity [50.91030850662369]
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations.
This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios.
We contribute a new benchmark named textbfUFineBench for text-based person retrieval with ultra-fine granularity.
arXiv Detail & Related papers (2023-12-06T11:50:14Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - Embedding Arithmetic for Text-driven Image Transformation [48.7704684871689]
Text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man.
Recent works aim at bridging this semantic gap embed images and text into a multimodal space.
We introduce the SIMAT dataset to evaluate the task of text-driven image transformation.
arXiv Detail & Related papers (2021-12-06T16:51:50Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph.
We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images.
The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z) - Retrieval Guided Unsupervised Multi-domain Image-to-Image Translation [59.73535607392732]
Image to image translation aims to learn a mapping that transforms an image from one visual domain to another.
We propose the use of an image retrieval system to assist the image-to-image translation task.
arXiv Detail & Related papers (2020-08-11T20:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.