Related papers: TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

URL: http://arxiv.org/abs/2512.16270v1
Date: Thu, 18 Dec 2025 07:37:08 GMT
Title: TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
Authors: Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, Alex Jinpeng Wang,
Abstract summary: We introduce TextEditBench, a comprehensive evaluation benchmark for text-centric regions in images.<n>Our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.<n>We also propose a novel evaluation dimension, Semantic Expectation, which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment.
Score: 18.337757379089037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.

Related papers

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z)
Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.<n>Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images [28.48453375674059]
DanceText is a training-free framework for multilingual text editing in images.<n>It supports complex geometric transformations and achieves seamless foreground-background integration.
arXiv Detail & Related papers (2025-04-18T23:46:32Z)
TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.<n>We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models [115.62816053600085]
We present DesignDiffusion, a framework for synthesizing design images from textual descriptions.<n>The proposed framework directly synthesizes textual and visual design elements from user prompts.<n>It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt.
arXiv Detail & Related papers (2025-03-03T15:22:57Z)
Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation [17.552733309504486]
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently as flat texts due to artistic design or layout constraints.<n>We introduce a new training-free framework, STGen, which accurately generates visual texts in challenging scenarios.
arXiv Detail & Related papers (2025-01-10T11:44:59Z)
Uncovering the Text Embedding in Text-to-Image Diffusion Models [17.108496821429494]
Text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. We identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding. We find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition.
arXiv Detail & Related papers (2024-04-01T14:59:13Z)
Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample. We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z)
FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework [19.564048493848272]
Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image. Existing style-transfer-based approaches have shown sub-par editing performance due to complex image backgrounds, diverse font attributes, and varying word lengths within the text. We propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations.
arXiv Detail & Related papers (2023-08-05T15:54:06Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.