DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
- URL: http://arxiv.org/abs/2602.21824v1
- Date: Wed, 25 Feb 2026 11:52:13 GMT
- Title: DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
- Authors: Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit, Ahmad Sarmad Ali, Muhammad Armaghan Shakir, Adrian Kalwa, Momina Moetesum, Andreas Dengel, Sheraz Ahmed, Faisal Shafait, Ulrich Schwanecke, Adrian Ulges,
- Abstract summary: We propose a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs)<n>Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset.<n>We show that our framework achieves on average $87%$ of the performance of the full real-world dataset.
- Score: 5.342168661302001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average $87\%$ of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.
Related papers
- FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models [4.013756026582041]
Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets.<n>We introduce FlexDoc, a scalable synthetic data generation framework.<n>We show that FlexDoc improves the absolute F1 Score by up to 11% when used to augment real datasets.
arXiv Detail & Related papers (2025-10-02T15:42:35Z) - ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z) - DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations [8.612181075294327]
In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning.<n>We produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples.<n>We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
arXiv Detail & Related papers (2025-07-08T13:55:25Z) - WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding [23.910783272007407]
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU)
Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset.
Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
arXiv Detail & Related papers (2024-08-27T03:31:24Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements.
We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z) - Generating Synthetic Handwritten Historical Documents With OCR
Constrained GANs [2.3808546906079178]
We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images.
We demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.
arXiv Detail & Related papers (2021-03-15T09:39:17Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.