Open Vocabulary Semantic Scene Sketch Understanding
- URL: http://arxiv.org/abs/2312.12463v2
- Date: Sat, 30 Mar 2024 11:35:52 GMT
- Title: Open Vocabulary Semantic Scene Sketch Understanding
- Authors: Ahmed Bourouis, Judith Ellen Fan, Yulia Gryaditskaya,
- Abstract summary: We study the underexplored but fundamental vision problem of machine understanding of freehand scene sketches.
We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task.
Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of $85.5%$ on the FS-COCO sketch dataset.
- Score: 5.638866331696071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories, we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly, we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding, and the second level focuses on individual categories. We, then, in the second level of the hierarchy, introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of $85.5\%$ on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches.
Related papers
- Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation [0.9208007322096532]
Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding.
Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes.
We propose a Class-Agnostic-Temporal Network (CAVT) for scene sketch semantic segmentation.
arXiv Detail & Related papers (2024-09-30T22:34:29Z) - Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation [14.998239253285394]
We propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance.
We show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
arXiv Detail & Related papers (2024-05-14T09:28:25Z) - ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention [7.783971241874388]
This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages.
In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network.
In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism.
arXiv Detail & Related papers (2023-11-28T10:53:55Z) - What Can Human Sketches Do for Object Detection? [127.67444974452411]
Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues.
A sketch-enabled object detection framework detects based on what textityou sketch -- textitthat zebra''
We show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR)
In particular, we first perform independent on both sketch branches of an encoder model to build highly generalisable sketch and photo encoders.
arXiv Detail & Related papers (2023-03-27T12:33:23Z) - Abstracting Sketches through Simple Primitives [53.04827416243121]
Humans show high-level of abstraction capabilities in games that require quickly communicating object information.
We propose the Primitive-based Sketch Abstraction task where the goal is to represent sketches using a fixed set of drawing primitives.
Our Primitive-Matching Network (PMN), learns interpretable abstractions of a sketch in a self supervised manner.
arXiv Detail & Related papers (2022-07-27T14:32:39Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in
Context [112.07988211268612]
We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO.
Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals.
We study for the first time the problem of the fine-grained image retrieval from freehand scene sketches and sketch captions.
arXiv Detail & Related papers (2022-03-04T03:00:51Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - One Sketch for All: One-Shot Personalized Sketch Segmentation [84.45203849671003]
We present the first one-shot personalized sketch segmentation method.
We aim to segment all sketches belonging to the same category with a single sketch with a given part annotation.
We preserve the parts semantics embedded in the exemplar, and we are robust to input style and abstraction.
arXiv Detail & Related papers (2021-12-20T20:10:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.