STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data
- URL: http://arxiv.org/abs/2511.09977v2
- Date: Fri, 14 Nov 2025 03:17:04 GMT
- Title: STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data
- Authors: Yongdeuk Seo, Hyun-seok Min, Sungchul Choi,
- Abstract summary: We propose STELLAR, a Scene Text Editor for Low-resource LAnguages and Real-world data.<n> STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy.<n>We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation.
- Score: 3.622341086373503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
Related papers
- TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z) - Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach [51.95266411355865]
Autoregressive language models are vulnerable to orthographic attacks.<n>This vulnerability stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings.<n>We propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images.
arXiv Detail & Related papers (2025-08-28T20:48:38Z) - TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition [19.566553192778525]
Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient real-world data.<n>We introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data.<n>It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement.<n>It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background.
arXiv Detail & Related papers (2024-12-02T05:26:25Z) - Text Image Generation for Low-Resource Languages with Dual Translation Learning [0.0]
This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages.
The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images.
To enhance the accuracy and variety of generated text images, we introduce two guidance techniques.
arXiv Detail & Related papers (2024-09-26T11:23:59Z) - LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model [20.007650672107566]
Video text spotting (VTS) aims to simultaneously localize, recognize and track text instances in videos.
Recent methods track the zero-shot results of state-of-the-art image text spotters directly.
Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements.
arXiv Detail & Related papers (2024-05-29T15:35:09Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Stroke-Based Scene Text Erasing Using Synthetic Data [0.0]
Scene text erasing can replace text regions with reasonable content in natural images.
The lack of a large-scale real-world scene-text removal dataset allows the existing methods to not work in full strength.
We enhance and make full use of the synthetic text and consequently train our model only on the dataset generated by the improved synthetic text engine.
This model can partially erase text instances in a scene image with a bounding box provided or work with an existing scene text detector for automatic scene text erasing.
arXiv Detail & Related papers (2021-04-23T09:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.