A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
- URL: http://arxiv.org/abs/2208.03354v1
- Date: Fri, 5 Aug 2022 18:43:37 GMT
- Title: A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
- Authors: Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays
- Abstract summary: We present an end-to-end trainable model for image retrieval using a text description and a sketch as input.
We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval.
- Score: 63.12810494378133
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the problem of retrieving images with both a sketch and a text
query. We present TASK-former (Text And SKetch transformer), an end-to-end
trainable model for image retrieval using a text description and a sketch as
input. We argue that both input modalities complement each other in a manner
that cannot be achieved easily by either one alone. TASK-former follows the
late-fusion dual-encoder approach, similar to CLIP, which allows efficient and
scalable retrieval since the retrieval set can be indexed independently of the
queries. We empirically demonstrate that using an input sketch (even a poorly
drawn one) in addition to text considerably increases retrieval recall compared
to traditional text-based image retrieval. To evaluate our approach, we collect
5,000 hand-drawn sketches for images in the test set of the COCO dataset. The
collected sketches are available a https://janesjanes.github.io/tsbir/.
Related papers
- You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Towards Interactive Image Inpainting via Sketch Refinement [13.34066589008464]
We propose a two-stage image inpainting method termed SketchRefiner.
In the first stage, we propose using a cross-correlation loss function to robustly calibrate and refine the user-provided sketches.
In the second stage, we learn to extract informative features from the abstracted sketches in the feature space and modulate the inpainting process.
arXiv Detail & Related papers (2023-06-01T07:15:54Z) - CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks.
The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID.
The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z) - I Know What You Draw: Learning Grasp Detection Conditioned on a Few
Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects.
Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z) - FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in
Context [112.07988211268612]
We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO.
Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals.
We study for the first time the problem of the fine-grained image retrieval from freehand scene sketches and sketch captions.
arXiv Detail & Related papers (2022-03-04T03:00:51Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image
Retrieval [203.2520862597357]
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user's query sketch.
We reformulate the conventional FG-SBIR framework to tackle these challenges.
We propose an on-the-fly design that starts retrieving as soon as the user starts drawing.
arXiv Detail & Related papers (2020-02-24T15:36:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.