ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity
- URL: http://arxiv.org/abs/2203.08101v1
- Date: Tue, 15 Mar 2022 17:29:20 GMT
- Title: ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity
- Authors: Ginger Delmas and Rafael Sampaio de Rezende and Gabriela Csurka and
Diane Larlus
- Abstract summary: Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
- Score: 16.550790981646276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An intuitive way to search for images is to use queries composed of an
example image and a complementary text. While the first provides rich and
implicit context for the search, the latter explicitly calls for new traits, or
specifies how some elements of the example image should be changed to retrieve
the desired target image. Current approaches typically combine the features of
each of the two elements of the query into a single representation, which can
then be compared to the ones of the potential target images. Our work aims at
shedding new light on the task by looking at it through the prism of two
familiar and related frameworks: text-to-image and image-to-image retrieval.
Taking inspiration from them, we exploit the specific relation of each query
element with the targeted image and derive light-weight attention mechanisms
which enable to mediate between the two complementary modalities. We validate
our approach on several retrieval benchmarks, querying with images and their
associated free-form text modifiers. Our method obtains state-of-the-art
results without resorting to side information, multi-level features, heavy
pre-training nor large architectures as in previous works.
Related papers
- MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions [64.89284104414865]
We introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions.
MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations.
MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks.
arXiv Detail & Related papers (2024-03-28T17:59:20Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Two-stream Hierarchical Similarity Reasoning for Image-text Matching [66.43071159630006]
A hierarchical similarity reasoning module is proposed to automatically extract context information.
Previous approaches only consider learning single-stream similarity alignment.
A two-stream architecture is developed to decompose image-text matching into image-to-text level and text-to-image level similarity computation.
arXiv Detail & Related papers (2022-03-10T12:56:10Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval [15.074592583852167]
We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images.
We propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change"
We show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques.
arXiv Detail & Related papers (2020-09-03T06:55:23Z) - Tasks Integrated Networks: Joint Detection and Retrieval for Image
Search [99.49021025124405]
In many real-world searching scenarios (e.g., video surveillance), the objects are seldom accurately detected or annotated.
We first introduce an end-to-end Integrated Net (I-Net), which has three merits.
We further propose an improved I-Net, called DC-I-Net, which makes two new contributions.
arXiv Detail & Related papers (2020-09-03T03:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.