Related papers: Synthesizing human-like sketches from natural images using a conditional convolutional decoder

Synthesizing human-like sketches from natural images using a conditional convolutional decoder

URL: http://arxiv.org/abs/2003.07101v1
Date: Mon, 16 Mar 2020 10:42:53 GMT
Title: Synthesizing human-like sketches from natural images using a conditional convolutional decoder
Authors: Moritz Kampelm\"uhler and Axel Pinz
Abstract summary: We propose a fully convolutional end-to-end architecture that is able to synthesize human-like sketches of objects in natural images. We train our structure in an end-to-end supervised fashion on a collection of sketch-image pairs. The generated sketches of our architecture can be classified with 85.6% Top-5 accuracy and we verify their visual quality via a user study.
Score: 3.3504365823045035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans are able to precisely communicate diverse concepts by employing sketches, a highly reduced and abstract shape based representation of visual content. We propose, for the first time, a fully convolutional end-to-end architecture that is able to synthesize human-like sketches of objects in natural images with potentially cluttered background. To enable an architecture to learn this highly abstract mapping, we employ the following key components: (1) a fully convolutional encoder-decoder structure, (2) a perceptual similarity loss function operating in an abstract feature space and (3) conditioning of the decoder on the label of the object that shall be sketched. Given the combination of these architectural concepts, we can train our structure in an end-to-end supervised fashion on a collection of sketch-image pairs. The generated sketches of our architecture can be classified with 85.6% Top-5 accuracy and we verify their visual quality via a user study. We find that deep features as a perceptual similarity metric enable image translation with large domain gaps and our findings further show that convolutional neural networks trained on image classification tasks implicitly learn to encode shape information. Code is available under https://github.com/kampelmuehler/synthesizing_human_like_sketches

Related papers

Disentangling Visual Priors: Unsupervised Learning of Scene Interpretations with Compositional Autoencoder [0.20718016474717196]
We propose a neurosymbolic architecture that uses a domain-specific language to capture selected priors of image formation. We express template programs in that language and learn their parameterization with features extracted from the scene by a convolutional neural network. When executed, the parameterized program produces geometric primitives which are rendered and assessed for correspondence with the scene content.
arXiv Detail & Related papers (2024-09-15T12:47:39Z)
Open Vocabulary Semantic Scene Sketch Understanding [5.638866331696071]
We study the underexplored but fundamental vision problem of machine understanding of freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space, which we evaluate by testing its performance on a semantic sketch segmentation task. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of $85.5%$ on the FS-COCO sketch dataset.
arXiv Detail & Related papers (2023-12-18T19:02:07Z)
Composer: Creative and Controllable Image Synthesis with Composable Conditions [57.78533372393828]
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity.
arXiv Detail & Related papers (2023-02-20T05:48:41Z)
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z)
A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture. The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses. We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z)
Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis. We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z)
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans [56.63912568777483]
This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. We propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh. Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality.
arXiv Detail & Related papers (2020-12-31T18:55:38Z)
Deep Generation of Face Images from Sketches [36.146494762987146]
Deep image-to-image translation techniques allow fast generation of face images from freehand sketches. Existing solutions tend to overfit to sketches, thus requiring professional sketches or even edge maps as input. We propose to implicitly model the shape space of plausible face images and synthesize a face image in this space to approximate an input sketch. Our method essentially uses input sketches as soft constraints and is thus able to produce high-quality face images even from rough and/or incomplete sketches.
arXiv Detail & Related papers (2020-06-01T16:20:23Z)
SketchyCOCO: Image Generation from Freehand Scene Sketches [71.85577739612579]
We introduce the first method for automatic image generation from scene-level freehand sketches. Key contribution is an attribute vector bridged Geneversarative Adrial Network called EdgeGAN. We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution.
arXiv Detail & Related papers (2020-03-05T14:54:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.