VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
- URL: http://arxiv.org/abs/2602.00122v1
- Date: Tue, 27 Jan 2026 16:51:05 GMT
- Title: VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
- Authors: Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Yixuan Yuan, Tianyu Zong, Xinming Wang, Tao Yu, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani Xinyu Zuo, Jungang Xu,
- Abstract summary: multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content in a flexible and interactive manner.<n> visual document image editing involves modifying textual content within images while preserving the original text style and background context.<n>Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts.
- Score: 45.37806172594631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.
Related papers
- VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z) - TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering [18.337757379089037]
We introduce TextEditBench, a comprehensive evaluation benchmark for text-centric regions in images.<n>Our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.<n>We also propose a novel evaluation dimension, Semantic Expectation, which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment.
arXiv Detail & Related papers (2025-12-18T07:37:08Z) - GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z) - Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.<n>Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z) - TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.<n>We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z) - IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment [6.627422081288281]
We introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images.<n>IE-Bench includes a database containing diverse source images, various editing prompts and the corresponding results.<n>We also introduce IE-QA, a multi-modality source-aware quality assessment method for text-driven image editing.
arXiv Detail & Related papers (2025-01-17T02:47:25Z) - Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing [26.086806549826058]
Text-guided image editing seeks the preservation of core elements in the source image while implementing modifications based on the target text.<n>Existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text.<n>We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects.
arXiv Detail & Related papers (2024-10-15T08:12:54Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Vision Language Model-based Caption Evaluation Method Leveraging Visual
Context Extraction [27.00018283430169]
This paper presents VisCE$2$, a vision language model-based caption evaluation method.
Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships.
arXiv Detail & Related papers (2024-02-28T01:29:36Z) - Learning the Visualness of Text Using Large Vision-Language Models [42.75864384249245]
Visual text evokes an image in a person's mind, while non-visual text fails to do so.
A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images.
We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators.
arXiv Detail & Related papers (2023-05-11T17:45:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.