Query-guided Attention in Vision Transformers for Localizing Objects
Using a Single Sketch
- URL: http://arxiv.org/abs/2303.08784v1
- Date: Wed, 15 Mar 2023 17:26:17 GMT
- Title: Query-guided Attention in Vision Transformers for Localizing Objects
Using a Single Sketch
- Authors: Aditay Tripathi, Anand Mishra, Anirban Chakraborty
- Abstract summary: Given a crude hand-drawn sketch of an object, the goal is to localize all instances of the same object on the target image.
This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images.
We propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features.
- Score: 17.63475613154152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we investigate the problem of sketch-based object localization
on natural images, where given a crude hand-drawn sketch of an object, the goal
is to localize all the instances of the same object on the target image. This
problem proves difficult due to the abstract nature of hand-drawn sketches,
variations in the style and quality of sketches, and the large domain gap
existing between the sketches and the natural images. To mitigate these
challenges, existing works proposed attention-based frameworks to incorporate
query information into the image features. However, in these works, the query
features are incorporated after the image features have already been
independently learned, leading to inadequate alignment. In contrast, we propose
a sketch-guided vision transformer encoder that uses cross-attention after each
block of the transformer-based image encoder to learn query-conditioned image
features leading to stronger alignment with the query sketch. Further, at the
output of the decoder, the object and the sketch features are refined to bring
the representation of relevant objects closer to the sketch query and thereby
improve the localization. The proposed model also generalizes to the object
categories not seen during training, as the target image features learned by
our method are query-aware. Our localization framework can also utilize
multiple sketch queries via a trainable novel sketch fusion strategy. The model
is evaluated on the images from the public object detection benchmark, namely
MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets.
Compared with existing localization methods, the proposed approach gives a
$6.6\%$ and $8.0\%$ improvement in mAP for seen objects using sketch queries
from QuickDraw! and Sketchy datasets, respectively, and a $12.2\%$ improvement
in AP@50 for large objects that are `unseen' during training.
Related papers
- What Can Human Sketches Do for Object Detection? [127.67444974452411]
Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues.
A sketch-enabled object detection framework detects based on what textityou sketch -- textitthat zebra''
We show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR)
In particular, we first perform independent on both sketch branches of an encoder model to build highly generalisable sketch and photo encoders.
arXiv Detail & Related papers (2023-03-27T12:33:23Z) - Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings [99.9788496281408]
We study how sketches can be used as a weak label to detect salient objects present in an image.
To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo.
Tests prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.
arXiv Detail & Related papers (2023-03-20T23:46:46Z) - I Know What You Draw: Learning Grasp Detection Conditioned on a Few
Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects.
Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z) - Localizing Infinity-shaped fishes: Sketch-guided object localization in
the wild [5.964436882344729]
This work investigates the problem of sketch-guided object localization.
Human sketches are used as queries to conduct the object localization in natural images.
We propose a sketch-conditioned DETR architecture which avoids a hard classification.
We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results.
arXiv Detail & Related papers (2021-09-24T10:39:43Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - Sketch-Guided Object Localization in Natural Images [16.982683600384277]
We introduce the novel problem of localizing all instances of an object (seen or unseen during training) in a natural image via sketch query.
We propose a novel cross-modal attention scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query.
Our method is effective with as little as a single sketch query.
arXiv Detail & Related papers (2020-08-14T19:35:56Z) - Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image
Retrieval [147.24102408745247]
We study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail.
In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels.
arXiv Detail & Related papers (2020-07-29T20:50:25Z) - Semantically Tied Paired Cycle Consistency for Any-Shot Sketch-based
Image Retrieval [55.29233996427243]
Low-shot sketch-based image retrieval is an emerging task in computer vision.
In this paper, we address any-shot, i.e. zero-shot and few-shot, sketch-based image retrieval (SBIR) tasks.
For solving these tasks, we propose a semantically aligned cycle-consistent generative adversarial network (SEM-PCYC)
Our results demonstrate a significant boost in any-shot performance over the state-of-the-art on the extended version of the Sketchy, TU-Berlin and QuickDraw datasets.
arXiv Detail & Related papers (2020-06-20T22:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.