Related papers: Exploring In-Image Machine Translation with Real-World Background

Exploring In-Image Machine Translation with Real-World Background

URL: http://arxiv.org/abs/2505.15282v1
Date: Wed, 21 May 2025 09:02:53 GMT
Title: Exploring In-Image Machine Translation with Real-World Background
Authors: Yanzhi Tian, Zeming Liu, Zhengyang Liu, Yuhang Guo,
Abstract summary: In-Image Machine Translation aims to translate texts within images from one language to another.<n>We propose the DebackX model, which separates the background and text-image from the source image.<n> Experimental results show that our model achieves improvements in both translation quality and visual effect.
Score: 5.839694459794486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.

Related papers

SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild [55.619708995575785]
The text in natural scene images needs to meet the following four key criteria.<n>The generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks.<n>The generated images have superior utility in OCR tasks like text detection and text recognition.
arXiv Detail & Related papers (2025-01-06T12:09:08Z)
Text Image Generation for Low-Resource Languages with Dual Translation Learning [0.0]
This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images. To enhance the accuracy and variety of generated text images, we introduce two guidance techniques.
arXiv Detail & Related papers (2024-09-26T11:23:59Z)
AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI) Our framework incorporates contextual cues from both textual and visual elements during translation. We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations. In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts. We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z)
Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z)
A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed Real-World Data [4.096453902709292]
Scene-text image synthesis techniques aim to naturally compose text instances on background scene images. We propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet) After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks.
arXiv Detail & Related papers (2022-09-06T11:15:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.