Related papers: Text Image Generation for Low-Resource Languages with Dual Translation Learning

Text Image Generation for Low-Resource Languages with Dual Translation Learning

URL: http://arxiv.org/abs/2409.17747v1
Date: Thu, 26 Sep 2024 11:23:59 GMT
Title: Text Image Generation for Low-Resource Languages with Dual Translation Learning
Authors: Chihiro Noguchi, Shun Fukuda, Shoichiro Mihara, Masao Yamanaka,
Abstract summary: This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images. To enhance the accuracy and variety of generated text images, we introduce two guidance techniques.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.

Related papers

STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data [3.622341086373503]
We propose STELLAR, a Scene Text Editor for Low-resource LAnguages and Real-world data.<n> STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy.<n>We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation.
arXiv Detail & Related papers (2025-11-13T05:18:03Z)
Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score [4.8677910801584385]
Large-scale text-to-image generative models have shown remarkable ability to synthesize diverse and high-quality images.<n>We present Dual Contrastive Denoising Score, a framework that leverages the rich generative prior of text-to-image diffusion models.<n>Our method achieves both flexible content modification and structure preservation between input and output images, as well as zero-shot image-to-image translation.
arXiv Detail & Related papers (2025-08-18T08:30:07Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training [68.41837295318152]
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text. We propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese.
arXiv Detail & Related papers (2024-10-06T10:25:39Z)
Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild. The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z)
ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures. We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z)
Weakly Supervised Scene Text Generation for Low-resource Languages [19.243705770491577]
A large number of annotated training images is crucial for training successful scene text recognition models. Existing scene text generation methods typically rely on a large amount of paired data, which is difficult to obtain for low-resource languages. We propose a novel weakly supervised scene text generation method that leverages a few recognition-level labels as weak supervision.
arXiv Detail & Related papers (2023-06-25T15:26:06Z)
Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain. We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z)
Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.