Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation
- URL: http://arxiv.org/abs/2508.17017v1
- Date: Sat, 23 Aug 2025 13:09:19 GMT
- Title: Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation
- Authors: Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Silvia Cascianelli, Rita Cucchiara, Marcus Liwicki,
- Abstract summary: Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results frequent, in-vocabulary words observed at training time and on regular styles.<n>They are prone to memorizing training samples and often struggle with style variability and generation clarity.<n>We propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages a negatively perturbed prompt onto the original prompt.<n> Experimental results on the state-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and variability even for out-of-vocabulary words and challenging writing styles.
- Score: 55.35931633405974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results on frequent, in-vocabulary words observed at training time and on regular styles. However, they are prone to memorizing training samples and often struggle with style variability and generation clarity. In particular, standard diffusion models tend to produce artifacts or distortions that negatively affect the readability of the generated text, especially when the style is hard to produce. To tackle these issues, we propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. This approach helps steer the generation away from artifacts while maintaining the intended content, and encourages more diverse, yet plausible, outputs. Unlike standard Classifier-Free Guidance (CFG), which relies on unconditional predictions and produces noise at high guidance scales, DOG introduces a more stable, disentangled direction in the latent space. To control the strength of the guidance across the denoising process, we apply a triangular schedule: weak at the start and end of denoising, when the process is most sensitive, and strongest in the middle steps. Experimental results on the state-of-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and style variability, even for out-of-vocabulary words and challenging writing styles.
Related papers
- Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models [11.196851704643406]
Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation.<n>We show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor.<n>We propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process.<n>TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.
arXiv Detail & Related papers (2026-01-30T06:39:33Z) - DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation [41.08176249345279]
DiffInk is the first latent diffusion Transformer framework for full-line handwriting generation.<n>We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses.<n>We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories.
arXiv Detail & Related papers (2025-09-28T03:58:15Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation [31.599902235859687]
We propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts.<n>NoiseQuery enables fine-grained control and yields significant performance boosts over high-level semantics and over low-level visual attributes.
arXiv Detail & Related papers (2024-12-06T14:59:00Z) - Scribble-Guided Diffusion for Training-free Text-to-Image Generation [17.930032337081673]
Scribble-Guided Diffusion (ScribbleDiff) is a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation.
We introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs.
arXiv Detail & Related papers (2024-09-12T13:13:07Z) - Text Diffusion with Reinforced Conditioning [92.17397504834825]
This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling.
Motivated by our findings, we propose a novel Text Diffusion model called TREC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling.
arXiv Detail & Related papers (2024-02-19T09:24:02Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - KEST: Kernel Distance Based Efficient Self-Training for Improving
Controllable Text Generation [24.47531522553703]
We propose KEST, a novel and efficient self-training framework to handle these problems.
KEST utilizes a kernel-based loss, rather than standard cross entropy, to learn from the soft pseudo text produced by a shared non-autoregressive generator.
Experiments on three controllable generation tasks demonstrate that KEST significantly improves control accuracy while maintaining comparable text fluency and generation diversity against several strong baselines.
arXiv Detail & Related papers (2023-06-17T19:40:57Z) - AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation [138.98095392584693]
We introduce Auto-Regressive Diffusion (AR-Diffusion) to account for the inherent sequential characteristic of natural language.
AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps.
In a series of experiments on various text generation tasks, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models.
arXiv Detail & Related papers (2023-05-16T15:10:22Z) - DuNST: Dual Noisy Self Training for Semi-Supervised Controllable Text
Generation [34.49194429157166]
Self-training (ST) has prospered again in language understanding by augmenting the fine-tuning of pre-trained language models when labeled data is insufficient.
It remains challenging to incorporate ST into attribute-controllable language generation.
arXiv Detail & Related papers (2022-12-16T21:44:34Z) - Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.