What Can Human Sketches Do for Object Detection?
- URL: http://arxiv.org/abs/2303.15149v2
- Date: Sat, 28 Oct 2023 17:58:15 GMT
- Title: What Can Human Sketches Do for Object Detection?
- Authors: Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and
Subhadeep Koley and Tao Xiang and Yi-Zhe Song
- Abstract summary: Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues.
A sketch-enabled object detection framework detects based on what textityou sketch -- textitthat zebra''
We show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR)
In particular, we first perform independent on both sketch branches of an encoder model to build highly generalisable sketch and photo encoders.
- Score: 127.67444974452411
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Sketches are highly expressive, inherently capturing subjective and
fine-grained visual cues. The exploration of such innate properties of human
sketches has, however, been limited to that of image retrieval. In this paper,
for the first time, we cultivate the expressiveness of sketches but for the
fundamental vision task of object detection. The end result is a sketch-enabled
object detection framework that detects based on what \textit{you} sketch --
\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of
zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of
a ``zebra") that you desire (part-aware detection). We further dictate that our
model works without (i) knowing which category to expect at testing (zero-shot)
and (ii) not requiring additional bounding boxes (as per fully supervised) and
class labels (as per weakly supervised). Instead of devising a model from the
ground up, we show an intuitive synergy between foundation models (e.g., CLIP)
and existing sketch models build for sketch-based image retrieval (SBIR), which
can already elegantly solve the task -- CLIP to provide model generalisation,
and SBIR to bridge the (sketch$\rightarrow$photo) gap. In particular, we first
perform independent prompting on both sketch and photo branches of an SBIR
model to build highly generalisable sketch and photo encoders on the back of
the generalisation ability of CLIP. We then devise a training paradigm to adapt
the learned encoders for object detection, such that the region embeddings of
detected boxes are aligned with the sketch and photo embeddings from SBIR.
Evaluating our framework on standard object detection datasets like PASCAL-VOC
and MS-COCO outperforms both supervised (SOD) and weakly-supervised object
detectors (WSOD) on zero-shot setups. Project Page:
\url{https://pinakinathc.github.io/sketch-detect}
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - Open Vocabulary Semantic Scene Sketch Understanding [5.638866331696071]
We study the underexplored but fundamental vision problem of machine understanding of freehand scene sketches.
We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task.
Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of $85.5%$ on the FS-COCO sketch dataset.
arXiv Detail & Related papers (2023-12-18T19:02:07Z) - Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings [99.9788496281408]
We study how sketches can be used as a weak label to detect salient objects present in an image.
To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo.
Tests prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.
arXiv Detail & Related papers (2023-03-20T23:46:46Z) - Query-guided Attention in Vision Transformers for Localizing Objects
Using a Single Sketch [17.63475613154152]
Given a crude hand-drawn sketch of an object, the goal is to localize all instances of the same object on the target image.
This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images.
We propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features.
arXiv Detail & Related papers (2023-03-15T17:26:17Z) - Abstracting Sketches through Simple Primitives [53.04827416243121]
Humans show high-level of abstraction capabilities in games that require quickly communicating object information.
We propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primitives.
Our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner.
arXiv Detail & Related papers (2022-07-27T14:32:39Z) - I Know What You Draw: Learning Grasp Detection Conditioned on a Few
Freehand Sketches [74.63313641583602]
We propose a method to generate a potential grasp configuration relevant to the sketch-depicted objects.
Our model is trained and tested in an end-to-end manner which is easy to be implemented in real-world applications.
arXiv Detail & Related papers (2022-05-09T04:23:36Z) - Localizing Infinity-shaped fishes: Sketch-guided object localization in
the wild [5.964436882344729]
This work investigates the problem of sketch-guided object localization.
Human sketches are used as queries to conduct the object localization in natural images.
We propose a sketch-conditioned DETR architecture which avoids a hard classification.
We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results.
arXiv Detail & Related papers (2021-09-24T10:39:43Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.