PRIM: Towards Practical In-Image Multilingual Machine Translation
- URL: http://arxiv.org/abs/2509.05146v1
- Date: Fri, 05 Sep 2025 14:38:07 GMT
- Title: PRIM: Towards Practical In-Image Multilingual Machine Translation
- Authors: Yanzhi Tian, Zeming Liu, Zhengyang Liu, Chong Feng, Xin Li, Heyan Huang, Yuhang Guo,
- Abstract summary: In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another.<n>Current research of end-to-end IIMT conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation.<n>We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM.
- Score: 48.357528732061105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.
Related papers
- PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models [32.38746546500033]
Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language.<n>We extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layout-preserving translation.<n>We construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios.<n>Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario.
arXiv Detail & Related papers (2025-09-14T08:33:23Z) - Exploring In-Image Machine Translation with Real-World Background [5.839694459794486]
In-Image Machine Translation aims to translate texts within images from one language to another.<n>We propose the DebackX model, which separates the background and text-image from the source image.<n> Experimental results show that our model achieves improvements in both translation quality and visual effect.
arXiv Detail & Related papers (2025-05-21T09:02:53Z) - Towards Visual Text Grounding of Multimodal Large Language Model [74.22413337117617]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - AnyTrans: Translate AnyText in the Image with Large Scale Models [88.5887934499388]
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
arXiv Detail & Related papers (2024-06-17T11:37:48Z) - Exploring the Necessity of Visual Modality in Multimodal Machine Translation using Authentic Datasets [3.54128607634285]
We study the impact of the visual modality on translation efficacy by leveraging real-world translation datasets.
We find that the visual modality proves advantageous for the majority of authentic translation datasets.
Our results suggest that visual information serves a supplementary role in multimodal translation and can be substituted.
arXiv Detail & Related papers (2024-04-09T08:19:10Z) - Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.