AnyTrans: Translate AnyText in the Image with Large Scale Models
- URL: http://arxiv.org/abs/2406.11432v1
- Date: Mon, 17 Jun 2024 11:37:48 GMT
- Title: AnyTrans: Translate AnyText in the Image with Large Scale Models
- Authors: Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji,
- Abstract summary: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI)
Our framework incorporates contextual cues from both textual and visual elements during translation.
We have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
- Score: 88.5887934499388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
Related papers
- Ensuring Consistency for In-Image Translation [47.1986912570945]
The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format.
We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency.
We introduce a novel two-stage framework named HCIIT which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage.
arXiv Detail & Related papers (2024-12-24T03:50:03Z) - LMFusion: Adapting Pretrained Language Models for Multimodal Generation [81.78257799283777]
We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities.
Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LMFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs.
arXiv Detail & Related papers (2024-12-19T18:56:24Z) - Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Towards Visual Text Design Transfer Across Languages [49.78504488452978]
We introduce a novel task of Multimodal Style Translation (MuST-Bench)
MuST-Bench is a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems.
In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions.
arXiv Detail & Related papers (2024-10-24T15:15:01Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - AnyText: Multilingual Visual Text Generation And Editing [18.811943975513483]
We introduce AnyText, a diffusion-based multilingual visual text generation and editing model.
AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation.
We contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages.
arXiv Detail & Related papers (2023-11-06T12:10:43Z) - De-Diffusion Makes Text a Strong Cross-Modal Interface [33.90004746543745]
We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
Experiments validate the precision and comprehensiveness of De-Diffusion text representing images.
A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
arXiv Detail & Related papers (2023-11-01T16:12:40Z) - Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Improving End-to-End Text Image Translation From the Auxiliary Text
Translation Task [26.046624228278528]
We propose a novel text translation enhanced text image translation, which trains the end-to-end model with text translation as an auxiliary task.
By sharing model parameters and multi-task training, our model is able to take full advantage of easily-available large-scale text parallel corpus.
arXiv Detail & Related papers (2022-10-08T02:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.