Sketch-Guided Object Localization in Natural Images
- URL: http://arxiv.org/abs/2008.06551v1
- Date: Fri, 14 Aug 2020 19:35:56 GMT
- Title: Sketch-Guided Object Localization in Natural Images
- Authors: Aditay Tripathi, Rajath R Dani, Anand Mishra, Anirban Chakraborty
- Abstract summary: We introduce the novel problem of localizing all instances of an object (seen or unseen during training) in a natural image via sketch query.
We propose a novel cross-modal attention scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query.
Our method is effective with as little as a single sketch query.
- Score: 16.982683600384277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the novel problem of localizing all the instances of an object
(seen or unseen during training) in a natural image via sketch query. We refer
to this problem as sketch-guided object localization. This problem is
distinctively different from the traditional sketch-based image retrieval task
where the gallery set often contains images with only one object. The
sketch-guided object localization proves to be more challenging when we
consider the following: (i) the sketches used as queries are abstract
representations with little information on the shape and salient attributes of
the object, (ii) the sketches have significant variability as they are
hand-drawn by a diverse set of untrained human subjects, and (iii) there exists
a domain gap between sketch queries and target natural images as these are
sampled from very different data distributions. To address the problem of
sketch-guided object localization, we propose a novel cross-modal attention
scheme that guides the region proposal network (RPN) to generate object
proposals relevant to the sketch query. These object proposals are later scored
against the query to obtain final localization. Our method is effective with as
little as a single sketch query. Moreover, it also generalizes well to object
categories not seen during training and is effective in localizing multiple
object instances present in the image. Furthermore, we extend our framework to
a multi-query setting using novel feature fusion and attention fusion
strategies introduced in this paper. The localization performance is evaluated
on publicly available object detection benchmarks, viz. MS-COCO and PASCAL-VOC,
with sketch queries obtained from `Quick, Draw!'. The proposed method
significantly outperforms related baselines on both single-query and
multi-query localization tasks.
Related papers
- DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval [85.39613457282107]
Cross-domain nature of sketch-based image retrieval is challenging.
We present an effective Adapt and Align'' approach to address the key challenges.
Inspired by recent advances in image-text foundation models (e.g., CLIP) on zero-shot scenarios, we explicitly align the learned image embedding with a more semantic text embedding to achieve the desired knowledge transfer from seen to unseen classes.
arXiv Detail & Related papers (2023-05-09T03:10:15Z) - Query-guided Attention in Vision Transformers for Localizing Objects
Using a Single Sketch [17.63475613154152]
Given a crude hand-drawn sketch of an object, the goal is to localize all instances of the same object on the target image.
This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images.
We propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features.
arXiv Detail & Related papers (2023-03-15T17:26:17Z) - Multimodal Query-guided Object Localization [5.424592317916519]
We present a multimodal query-guided object localization approach under the challenging open-set setting.
In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object.
We present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals.
arXiv Detail & Related papers (2022-12-01T18:35:03Z) - Grounding Scene Graphs on Natural Images via Visio-Lingual Message
Passing [17.63475613154152]
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints in a scene graph.
A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image.
arXiv Detail & Related papers (2022-11-03T16:46:46Z) - Localizing Infinity-shaped fishes: Sketch-guided object localization in
the wild [5.964436882344729]
This work investigates the problem of sketch-guided object localization.
Human sketches are used as queries to conduct the object localization in natural images.
We propose a sketch-conditioned DETR architecture which avoids a hard classification.
We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results.
arXiv Detail & Related papers (2021-09-24T10:39:43Z) - Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval [66.37346493506737]
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task.
We propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR.
Our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets.
arXiv Detail & Related papers (2021-06-22T14:58:08Z) - Prototypical Region Proposal Networks for Few-Shot Localization and
Classification [1.5100087942838936]
We develop a framework to unifysegmentation and classification into an end-to-end classification model -- PRoPnet.
We empirically demonstrate that our methods improve accuracy on image datasets with natural scenes containing multiple object classes.
arXiv Detail & Related papers (2021-04-08T04:03:30Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - Cross-Descriptor Visual Localization and Mapping [81.16435356103133]
Visual localization and mapping is the key technology underlying the majority of Mixed Reality and robotics systems.
We present three novel scenarios for localization and mapping which require the continuous update of feature representations.
Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms.
arXiv Detail & Related papers (2020-12-02T18:19:51Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.