Related papers: A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

URL: http://arxiv.org/abs/2208.03354v1
Date: Fri, 5 Aug 2022 18:43:37 GMT
Title: A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Authors: Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays
Abstract summary: We present an end-to-end trainable model for image retrieval using a text description and a sketch as input. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval.
Score: 63.12810494378133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/.

Related papers

Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions [6.8273484064357515]
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them. We propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image. Our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities.
arXiv Detail & Related papers (2025-02-12T14:22:59Z)
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models. Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z)
Towards Interactive Image Inpainting via Sketch Refinement [13.34066589008464]
We propose a two-stage image inpainting method termed SketchRefiner. In the first stage, we propose using a cross-correlation loss function to robustly calibrate and refine the user-provided sketches. In the second stage, we learn to extract informative features from the abstracted sketches in the feature space and modulate the inpainting process.
arXiv Detail & Related papers (2023-06-01T07:15:54Z)
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z)
I Know What You Draw: Learning Grasp Detection Conditioned on a Few Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects. Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z)
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context [112.07988211268612]
We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals. We study for the first time the problem of the fine-grained image retrieval from freehand scene sketches and sketch captions.
arXiv Detail & Related papers (2022-03-04T03:00:51Z)
Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z)
Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval [203.2520862597357]
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user's query sketch. We reformulate the conventional FG-SBIR framework to tackle these challenges. We propose an on-the-fly design that starts retrieving as soon as the user starts drawing.
arXiv Detail & Related papers (2020-02-24T15:36:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.