StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
- URL: http://arxiv.org/abs/2510.20093v1
- Date: Thu, 23 Oct 2025 00:27:32 GMT
- Title: StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
- Authors: Jiho Park, Sieun Choi, Jaeyoon Seo, Jihie Kim,
- Abstract summary: We propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity.<n>We fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches.<n>In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency.
- Score: 4.851573895718146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.
Related papers
- VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation [73.23035143627598]
Most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing.<n>We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models.<n>Our method generates high-quality sketches that closely follow text-specified orderings while exhibiting rich visual detail.
arXiv Detail & Related papers (2026-02-17T18:55:03Z) - CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model [18.5540421907361]
Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists.<n>We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models.<n> Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative models.
arXiv Detail & Related papers (2025-04-11T05:11:17Z) - SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation [57.47730473674261]
We introduce SwiftSketch, a model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second.<n>SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution.<n>ControlSketch is a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet.
arXiv Detail & Related papers (2025-02-12T18:57:12Z) - SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches [4.55322003438174]
SketchFlex is an interactive system designed to improve the flexibility of spatially conditioned image generation.<n>It infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships.<n>It refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions.
arXiv Detail & Related papers (2025-02-11T13:48:11Z) - Training-Free Sketch-Guided Diffusion with Latent Optimization [22.94468603089249]
We propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition.<n>To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models.<n>We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process.
arXiv Detail & Related papers (2024-08-31T00:44:03Z) - ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps.
We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z) - Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors.
Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables.
We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z) - VisioBlend: Sketch and Stroke-Guided Denoising Diffusion Probabilistic Model for Realistic Image Generation [0.0]
We propose a unified framework supporting three-dimensional control over image synthesis from sketches and strokes based on diffusion models.
It enables users to decide the level of faithfulness to the input strokes and sketches.
It solves the problem of data availability by synthesizing new data points from hand-drawn sketches and strokes.
arXiv Detail & Related papers (2024-05-15T11:27:27Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - SketchyCOCO: Image Generation from Freehand Scene Sketches [71.85577739612579]
We introduce the first method for automatic image generation from scene-level freehand sketches.
Key contribution is an attribute vector bridged Geneversarative Adrial Network called EdgeGAN.
We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution.
arXiv Detail & Related papers (2020-03-05T14:54:10Z) - Deep Plastic Surgery: Robust and Controllable Image Editing with
Human-Drawn Sketches [133.01690754567252]
Sketch-based image editing aims to synthesize and modify photos based on the structural information provided by the human-drawn sketches.
Deep Plastic Surgery is a novel, robust and controllable image editing framework that allows users to interactively edit images using hand-drawn sketch inputs.
arXiv Detail & Related papers (2020-01-09T08:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.